Merge pull request #3900 from AI-Hypercomputer:update_maxtext_version

Google-ML-Automation · Google-ML-Automation · commit 2cbf7fd63e4d · 2026-05-13T17:26:03.000-07:00
PiperOrigin-RevId: 915142180
diff --git a/docs/build_maxtext.md b/docs/build_maxtext.md
@@ -65,7 +65,7 @@ source ${VENV_NAME?}/bin/activate
 # This enables Docker image building and workload scheduling via XPK.
 # Once installed, you will have access to the `build_maxtext_docker_image`
 # and `upload_maxtext_docker_image` commands.
-uv pip install maxtext[runner]==0.2.1 --resolution=lowest
+uv pip install maxtext[runner]=={{version}} --resolution=lowest
 ```
 
 > **Note:** The `maxtext[runner]` extra includes all necessary dependencies for building MaxText Docker images and running workloads through XPK. It automatically installs XPK, so you do not need to install it separately to manage your clusters and workloads.
@@ -78,25 +78,7 @@ If you plan to contribute to MaxText or need the latest unreleased features, ins
 # Clone the repository
 git clone https://github.com/AI-Hypercomputer/maxtext.git
 cd maxtext
-```
-
-:::\{only} is_not_latest
-
-By default, cloning the repository provides the latest version (**HEAD**).
-If you wish to use the latest features, please follow the [latest guide](https://maxtext.readthedocs.io/en/latest/install_maxtext.html).
-If you want to ensure compatibility with the specific version of the documentation
-you are currently viewing, you must checkout the corresponding tag for that version
-before proceeding with the installation.
-
-```{eval-rst}
-.. parsed-literal::
 
-  git checkout |version|
-```
-
-:::
-
-```bash
 # Create virtual environment
 export VENV_NAME=<VENV_NAME> # e.g., docker_venv
 uv venv --python 3.12 --seed ${VENV_NAME?}
diff --git a/docs/conf.py b/docs/conf.py
@@ -27,6 +27,7 @@
 
 import os
 import os.path
+import re
 import sys
 import logging
 from sphinx.util import logging as sphinx_logging
@@ -42,7 +43,15 @@
 # pylint: disable=redefined-builtin
 copyright = "2023–2026, Google LLC"
 author = "MaxText developers"
-version = os.environ.get("READTHEDOCS_VERSION", "latest")
+
+# Get version from the __init__.py file
+init_path = os.path.abspath(os.path.join(MAXTEXT_REPO_ROOT, "src", "maxtext", "__init__.py"))
+with open(init_path, "r", encoding="utf-8") as f:
+  match = re.search(r"^__version__ = ['\"]([^'\"]*)['\"]", f.read(), re.MULTILINE)
+  if match:
+    version = match.group(1)
+  else:
+    raise RuntimeError("Unable to find version string.")
 
 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
@@ -247,6 +256,12 @@ def filter(self, record: logging.LogRecord) -> bool:
     return not msg.strip().startswith(filter_out)
 
 
+def substitute_placeholders(app, docname, source):
+  result = source[0]
+  result = result.replace("{{version}}", version)
+  source[0] = result
+
+
 def setup(app):
   """Set up the Sphinx application with custom behavior."""
 
@@ -259,5 +274,4 @@ def setup(app):
   warning_handler, *_ = [h for h in logger.handlers if isinstance(h, sphinx_logging.WarningStreamHandler)]
   warning_handler.filters.insert(0, FilterSphinxWarnings(app))
 
-  if version != "latest":
-    app.tags.add("is_not_latest")
+  app.connect("source-read", substitute_placeholders)
diff --git a/docs/guides/checkpointing_solutions/convert_checkpoint.md b/docs/guides/checkpointing_solutions/convert_checkpoint.md
@@ -23,7 +23,7 @@ The following models are supported:
 
 ## Prerequisites
 
-- MaxText must be installed in a Python virtual environment using the `maxtext[tpu]` option. For instructions on installing MaxText on your VM, please refer to the official [installation documentation](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/install_maxtext.html).
+- MaxText must be installed in a Python virtual environment using the `maxtext[tpu]` option. For instructions on installing MaxText on your VM, please refer to the official [installation documentation](../../install_maxtext.md).
 - Hugging Face model checkpoints are cached locally at `$HOME/.cache/huggingface/hub` before conversion. Ensure you have sufficient disk space.
 - Authenticate via the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/v0.21.2/guides/cli) if using private or gated models.
 
@@ -71,7 +71,7 @@ You can find your converted checkpoint files under `${BASE_OUTPUT_DIRECTORY}/0/i
 ### Key Parameters
 
 - `model_name`: The specific model identifier. It must match a supported entry in the MaxText [globals.py](https://github.com/AI-Hypercomputer/maxtext/blob/16b684840db9b96b19e24e84ac49f06af7204ae3/src/maxtext/utils/globals.py#L46C1-L46C7).
-- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/checkpoints.html) for more information.
+- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information.
 - `use_multimodal`: Indicates if multimodality is used, important for Gemma3.
 - `base_output_directory`: The path where the converted Orbax checkpoint will be stored; it can be Google Cloud Storage (GCS) or local.
 - `hardware=cpu`: The conversion script runs on a CPU machine.
@@ -118,7 +118,7 @@ python3 -m maxtext.checkpoint_conversion.to_huggingface \
 
 - `model_name`: The specific model identifier. It must match a supported entry in the MaxText [globals.py](https://github.com/AI-Hypercomputer/maxtext/blob/16b684840db9b96b19e24e84ac49f06af7204ae3/src/maxtext/utils/globals.py#L46C1-L46C7).
 - `load_parameters_path`: The path to the MaxText Orbax checkpoint.
-- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/checkpoints.html) for more information.
+- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information.
 - `use_multimodal`: Indicates if multimodality is used, important for Gemma3.
 - `hardware=cpu`: The conversion script runs on a CPU machine.
 - `base_output_directory`: The path where the converted checkpoint will be stored; it can be Google Cloud Storage (GCS), Hugging Face Hub or local.
@@ -128,7 +128,7 @@ python3 -m maxtext.checkpoint_conversion.to_huggingface \
 
 To ensure the conversion was successful, you can use the [test script](https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/utils/forward_pass_logit_checker.py). It runs a forward pass on both the original and converted models and compares the output logits to verify conversion. It is used to verify the bidirectional conversion.
 
-> **Note:** This correctness test will only work when MaxText is installed from source by following the installation instructions [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/install_maxtext.html#from-source).
+> **Note:** This correctness test will only work when MaxText is installed from source by following the installation instructions [here](../../install_maxtext.md#from-source).
 
 ### Setup Environment
 
@@ -159,7 +159,7 @@ python3 -m tests.utils.forward_pass_logit_checker src/maxtext/configs/base.yml \
 
 - `load_parameters_path`: The path to the MaxText Orbax checkpoint (e.g., `gs://your-bucket/maxtext-checkpoint/0/items`).
 - `model_name`: The corresponding model name in the MaxText configuration (e.g., `qwen3-4b`).
-- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/checkpoints.html) for more information.
+- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information.
 - `use_multimodal`: Indicates if multimodality is used.
 - `--run_hf_model` (Optional): Indicates if loading Hugging Face model from the hf_model_path. If not set, it will compare the maxtext logits with pre-saved golden logits.
 - `--hf_model_path` (Optional): The path to the Hugging Face checkpoint (if `--run_hf_model=True`).
diff --git a/docs/guides/optimization/custom_model.md b/docs/guides/optimization/custom_model.md
@@ -254,7 +254,7 @@ Ironwood over ICI:
 - `3 * M * 8 / 2 > 12800`
 - `M > 1100`
 
-It is important to emphasize that this is a theoretical roofline analysis. Real-world performance will depend on the efficiency of the implementation and XLA compilation on the TPU. Refer to the [link](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/optimization/sharding.html) for specific challenges regarding PP + FSDP/DP.
+It is important to emphasize that this is a theoretical roofline analysis. Real-world performance will depend on the efficiency of the implementation and XLA compilation on the TPU. Refer to the [link](../optimization/sharding.md) for specific challenges regarding PP + FSDP/DP.
 
 ## Step 4. Analyze experiments
 
diff --git a/docs/install_maxtext.md b/docs/install_maxtext.md
@@ -1,5 +1,5 @@
 <!--
- Copyright 2023-2025 Google LLC
+ Copyright 2023-2026 Google LLC
 
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
@@ -51,22 +51,22 @@ This is the easiest way to get started with the latest stable version.
      TPUs.
 
      ```bash
-     uv pip install maxtext[tpu]==0.2.1 --resolution=lowest
+     uv pip install maxtext[tpu]=={{version}} --resolution=lowest
      ```
 
    - **Option 2:** Install `maxtext[cuda12]`, used for pre-training and decoding
      on GPUs.
 
      ```bash
-     uv pip install maxtext[cuda12]==0.2.1 --resolution=lowest
+     uv pip install maxtext[cuda12]=={{version}} --resolution=lowest
      ```
 
    - **Option 3:** Install `maxtext[tpu-post-train]`, used for post-training on
      TPUs. Currently, this option should also be used for running `vllm_decode`
      on TPUs.
 
      ```bash
-     uv pip install maxtext[tpu-post-train]==0.2.1 --resolution=lowest
+     uv pip install maxtext[tpu-post-train]=={{version}} --resolution=lowest
      ```
 
    - **Option 4:** Install `maxtext[runner]`, used for building MaxText's Docker
@@ -78,7 +78,7 @@ This is the easiest way to get started with the latest stable version.
      guide.
 
      ```bash
-     uv pip install maxtext[runner]==0.2.1 --resolution=lowest
+     uv pip install maxtext[runner]=={{version}} --resolution=lowest
      ```
 
 ```{note}
@@ -112,22 +112,6 @@ environment to avoid dependency conflicts.
    cd maxtext
    ```
 
-:::\{only} is_not_latest
-
-By default, cloning the repository provides the latest version (**HEAD**).
-If you wish to use the latest features, please follow the [latest guide](https://maxtext.readthedocs.io/en/latest/install_maxtext.html).
-If you want to ensure compatibility with the specific version of the documentation
-you are currently viewing, you must checkout the corresponding tag for that version
-before proceeding with the installation.
-
-```{eval-rst}
-.. parsed-literal::
-
-  git checkout |version|
-```
-
-:::
-
 2. Create virtual environment:
 
    ```bash
diff --git a/docs/reference/architecture/jax_ai_libraries_chosen.md b/docs/reference/architecture/jax_ai_libraries_chosen.md
@@ -60,7 +60,7 @@ For more information on using Orbax, please refer to https://github.com/google/o
 
 Its APIs are explicitly designed for the multi-host paradigm, simplifying the process of ensuring that each host loads a unique shard of the global batch.
 
-For more information on using Grain, please refer to https://github.com/google/grain and the grain guide in maxtext located at https://maxtext.readthedocs.io/en/latest/guides/data_input_pipeline/data_input_grain.html
+For more information on using Grain, please refer to https://github.com/google/grain and the grain guide in maxtext located [here](../../guides/data_input_pipeline/data_input_grain.md).
 
 ## Qwix: For native JAX quantization
 
diff --git a/docs/reference/core_concepts/batch_size.md b/docs/reference/core_concepts/batch_size.md
@@ -34,11 +34,11 @@ You can set `per_device_batch_size` and `gradient_accumulation_steps` in `config
 
 `global_batch_to_load` = `global_batch_size_to_train_on x expansion_factor_real_data`
 
-When `expansion_factor_real_data > 1`, only a subset of hosts read data from the source (e.g., a GCS bucket). These "loading hosts" read more data than they need for their own devices and distribute the surplus to other "non-loading" hosts. This reduces the number of concurrent connections to the data source, which can significantly improve I/O throughput. When set to between 0 and 1, it's for grain pipeline to use a smaller chip count to read checkpoint from a larger chip count job. Details in https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/data_input_pipeline/data_input_grain.html#using-grain.
+When `expansion_factor_real_data > 1`, only a subset of hosts read data from the source (e.g., a GCS bucket). These "loading hosts" read more data than they need for their own devices and distribute the surplus to other "non-loading" hosts. This reduces the number of concurrent connections to the data source, which can significantly improve I/O throughput. When set to between 0 and 1, it's for grain pipeline to use a smaller chip count to read checkpoint from a larger chip count job. Details [here](../../guides/data_input_pipeline/data_input_grain.md#using-grain).
 
 ## Gradient Accumulation Steps
 
-`gradient_accumulation_steps` defines how many forward/backward passes are performed before the optimizer updates the model weights. The gradients from each pass are accumulated (summed). It is discussed in more detail [here](https://maxtext.readthedocs.io/en/latest/reference/core_concepts/tiling.html#gradient-accumulation).
+`gradient_accumulation_steps` defines how many forward/backward passes are performed before the optimizer updates the model weights. The gradients from each pass are accumulated (summed). It is discussed in more detail [here](../core_concepts/tiling.md#gradient-accumulation).
 
 For example, if `gradient_accumulation_steps` is set to `4`, the model will execute four forward and backward passes, sum the gradients, and then apply a single optimizer step. This achieves the same effective global batch size as quadrupling the `per_device_batch_size` with significantly less memory, but can potentially lead to lower MFU.
 
diff --git a/docs/reference/core_concepts/tiling.md b/docs/reference/core_concepts/tiling.md
@@ -80,4 +80,4 @@ Tiling is also crucial for managing data movement across the memory hierarchy (H
 
 **Tiling** and **sharding** are independent concepts that do not conflict; in fact, they are often used together. Sharding distributes a tensor across multiple devices, while tiling processes a tensor in chunks on the same device.
 
-To learn more about sharding in MaxText, please refer to the [sharding documentation](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/optimization/sharding.html).
+To learn more about sharding in MaxText, please refer to the [sharding documentation](../../guides/optimization/sharding.md).
diff --git a/docs/reference/models/supported_models_and_architectures.md b/docs/reference/models/supported_models_and_architectures.md
@@ -10,7 +10,7 @@ MaxText is an open-source, high-performance LLM framework written in Python/JAX.
 
 - **Supported Precisions**: FP32, BF16, INT8, and FP8.
 - **Ahead-of-Time Compilation (AOT)**: For faster model development/prototyping and earlier OOM detection.
-- **Quantization**: Via **Qwix** (recommended) and AQT. See Quantization [Guide](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/quantization.html).
+- **Quantization**: Via **Qwix** (recommended) and AQT. See Quantization [Guide](../reference/core_concepts/quantization.md).
 - **Diagnostics**: Structured error context via **`cloud_tpu_diagnostics`** (filters stack traces to user code), simple logging via `max_logging`, profiling in **XProf**, and visualization in **TensorBoard**.
 - **Multi-Token Prediction (MTP)**: Enables token efficient training with multi-token prediction.
 - **Elastic Training**: Fault-tolerant and dynamic scale-up/scale-down on Cloud TPUs with Pathways.
diff --git a/docs/tutorials/first_run.md b/docs/tutorials/first_run.md
@@ -36,7 +36,7 @@ Local development is a convenient way to run MaxText on a single host. It doesn'
 multiple hosts but is a good way to learn about MaxText.
 
 1. [Create and SSH to the single host VM of your choice](https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm). You can use any available single host TPU, such as `v5litepod-8`, `v5p-8`, or `v4-8`.
-2. For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). For this tutorial on TPUs, install `maxtext[tpu]`.
+2. For instructions on installing MaxText on your VM, please refer to the [official documentation](../install_maxtext.md). For this tutorial on TPUs, install `maxtext[tpu]`.
 3. After installation completes, run training on synthetic data with the following command:
 
 ```sh
@@ -70,7 +70,7 @@ You can use [demo_decoding.ipynb](https://github.com/AI-Hypercomputer/maxtext/bl
 
 ### Run MaxText on NVIDIA GPUs
 
-1. For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). For this tutorial on GPUs, install `maxtext[cuda12]`.
+1. For instructions on installing MaxText on your VM, please refer to the [official documentation](../install_maxtext.md). For this tutorial on GPUs, install `maxtext[cuda12]`.
 2. After installation is complete, run training with the following command on synthetic data:
 
 ```sh
diff --git a/docs/tutorials/inference.md b/docs/tutorials/inference.md
@@ -25,7 +25,7 @@ We support inference of MaxText models on vLLM via an [out-of-tree](https://gith
 
 # Installation
 
-Follow the instructions in [install maxtext](https://maxtext.readthedocs.io/en/latest/install_maxtext.html) to install MaxText. For this inference tutorial on TPU (which uses vLLM), you must install `maxtext[tpu-post-train]`, as it includes the required adapter plugin. We recommend installing from PyPI to ensure you have the latest stable version of dependencies.
+Follow the instructions in [install maxtext](../install_maxtext.md) to install MaxText. For this inference tutorial on TPU (which uses vLLM), you must install `maxtext[tpu-post-train]`, as it includes the required adapter plugin. We recommend installing from PyPI to ensure you have the latest stable version of dependencies.
 
 After finishing the installation, ensure that the MaxText on vLLM adapter plugin has been installed. To do so, run the following command:
 
diff --git a/docs/tutorials/posttraining/full_finetuning.md b/docs/tutorials/posttraining/full_finetuning.md
@@ -24,7 +24,7 @@ In this tutorial we use a single host TPU VM such as `v6e-8/v5p-8`. Let's get st
 
 ## Install dependencies
 
-For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/install_maxtext.html) and use the `maxtext[tpu]` installation path to include all necessary dependencies.
+For instructions on installing MaxText on your VM, please refer to the [official documentation](../../install_maxtext.md) and use the `maxtext[tpu]` installation path to include all necessary dependencies.
 
 ## Setup environment variables
 
@@ -70,7 +70,7 @@ export MAXTEXT_CKPT_PATH=<CKPT_PATH> # e.g., gs://my-bucket/my-model-checkpoint/
 
 ### Option 2: Converting a Hugging Face checkpoint
 
-Refer the steps in [Hugging Face to MaxText](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/checkpointing_solutions/convert_checkpoint.html#hugging-face-to-maxtext) to convert a hugging face checkpoint to MaxText. Make sure you have correct checkpoint files converted and saved. Similar as Option 1, you can set the following environment and move on.
+Refer the steps in [Hugging Face to MaxText](../../guides/checkpointing_solutions/convert_checkpoint.md#hugging-face-to-maxtext) to convert a hugging face checkpoint to MaxText. Make sure you have correct checkpoint files converted and saved. Similar as Option 1, you can set the following environment and move on.
 
 ```bash
 export MAXTEXT_CKPT_PATH=<CKPT_PATH> # gs://my-bucket/my-checkpoint-directory/0/items
diff --git a/docs/tutorials/posttraining/knowledge_distillation.md b/docs/tutorials/posttraining/knowledge_distillation.md
@@ -49,7 +49,7 @@ export RUN_NAME=<your-run-name> # e.g., distill-20260115
 
 To install MaxText and its dependencies for post-training (including vLLM for the teacher), run the following:
 
-1. Follow the [MaxText installation instructions](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#install-maxtext).
+1. Follow the [MaxText installation instructions](../../install_maxtext.md).
 
 2. Install the additional dependencies for post-training:
 
diff --git a/docs/tutorials/posttraining/lora.md b/docs/tutorials/posttraining/lora.md
diff --git a/docs/tutorials/posttraining/rl.md b/docs/tutorials/posttraining/rl.md
diff --git a/docs/tutorials/posttraining/rl_on_multi_host.md b/docs/tutorials/posttraining/rl_on_multi_host.md
diff --git a/docs/tutorials/posttraining/sft.md b/docs/tutorials/posttraining/sft.md
diff --git a/docs/tutorials/posttraining/sft_on_multi_host.md b/docs/tutorials/posttraining/sft_on_multi_host.md

Original file line number	Diff line number	Diff line change
`@@ -80,4 +80,4 @@ Tiling is also crucial for managing data movement across the memory hierarchy (H`
`80`	`80`
`81`	`81`	`Tiling and sharding are independent concepts that do not conflict; in fact, they are often used together. Sharding distributes a tensor across multiple devices, while tiling processes a tensor in chunks on the same device.`
`82`	`82`
`83`		`-To learn more about sharding in MaxText, please refer to the [sharding documentation](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/optimization/sharding.html).`
	`83`	`+To learn more about sharding in MaxText, please refer to the [sharding documentation](../../guides/optimization/sharding.md).`