You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> **Note:** The `maxtext[runner]` extra includes all necessary dependencies for building MaxText Docker images and running workloads through XPK. It automatically installs XPK, so you do not need to install it separately to manage your clusters and workloads.
@@ -78,25 +78,7 @@ If you plan to contribute to MaxText or need the latest unreleased features, ins
Copy file name to clipboardExpand all lines: docs/guides/checkpointing_solutions/convert_checkpoint.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@ The following models are supported:
23
23
24
24
## Prerequisites
25
25
26
-
- MaxText must be installed in a Python virtual environment using the `maxtext[tpu]` option. For instructions on installing MaxText on your VM, please refer to the official [installation documentation](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/install_maxtext.html).
26
+
- MaxText must be installed in a Python virtual environment using the `maxtext[tpu]` option. For instructions on installing MaxText on your VM, please refer to the official [installation documentation](../../install_maxtext.md).
27
27
- Hugging Face model checkpoints are cached locally at `$HOME/.cache/huggingface/hub` before conversion. Ensure you have sufficient disk space.
28
28
- Authenticate via the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/v0.21.2/guides/cli) if using private or gated models.
29
29
@@ -71,7 +71,7 @@ You can find your converted checkpoint files under `${BASE_OUTPUT_DIRECTORY}/0/i
71
71
### Key Parameters
72
72
73
73
-`model_name`: The specific model identifier. It must match a supported entry in the MaxText [globals.py](https://github.com/AI-Hypercomputer/maxtext/blob/16b684840db9b96b19e24e84ac49f06af7204ae3/src/maxtext/utils/globals.py#L46C1-L46C7).
74
-
-`scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/checkpoints.html) for more information.
74
+
-`scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information.
75
75
-`use_multimodal`: Indicates if multimodality is used, important for Gemma3.
76
76
-`base_output_directory`: The path where the converted Orbax checkpoint will be stored; it can be Google Cloud Storage (GCS) or local.
77
77
-`hardware=cpu`: The conversion script runs on a CPU machine.
-`model_name`: The specific model identifier. It must match a supported entry in the MaxText [globals.py](https://github.com/AI-Hypercomputer/maxtext/blob/16b684840db9b96b19e24e84ac49f06af7204ae3/src/maxtext/utils/globals.py#L46C1-L46C7).
120
120
-`load_parameters_path`: The path to the MaxText Orbax checkpoint.
121
-
-`scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/checkpoints.html) for more information.
121
+
-`scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information.
122
122
-`use_multimodal`: Indicates if multimodality is used, important for Gemma3.
123
123
-`hardware=cpu`: The conversion script runs on a CPU machine.
124
124
-`base_output_directory`: The path where the converted checkpoint will be stored; it can be Google Cloud Storage (GCS), Hugging Face Hub or local.
To ensure the conversion was successful, you can use the [test script](https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/utils/forward_pass_logit_checker.py). It runs a forward pass on both the original and converted models and compares the output logits to verify conversion. It is used to verify the bidirectional conversion.
130
130
131
-
> **Note:** This correctness test will only work when MaxText is installed from source by following the installation instructions [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/install_maxtext.html#from-source).
131
+
> **Note:** This correctness test will only work when MaxText is installed from source by following the installation instructions [here](../../install_maxtext.md#from-source).
-`load_parameters_path`: The path to the MaxText Orbax checkpoint (e.g., `gs://your-bucket/maxtext-checkpoint/0/items`).
161
161
-`model_name`: The corresponding model name in the MaxText configuration (e.g., `qwen3-4b`).
162
-
-`scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/checkpoints.html) for more information.
162
+
-`scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information.
163
163
-`use_multimodal`: Indicates if multimodality is used.
164
164
-`--run_hf_model` (Optional): Indicates if loading Hugging Face model from the hf_model_path. If not set, it will compare the maxtext logits with pre-saved golden logits.
165
165
-`--hf_model_path` (Optional): The path to the Hugging Face checkpoint (if `--run_hf_model=True`).
Copy file name to clipboardExpand all lines: docs/guides/optimization/custom_model.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -254,7 +254,7 @@ Ironwood over ICI:
254
254
-`3 * M * 8 / 2 > 12800`
255
255
-`M > 1100`
256
256
257
-
It is important to emphasize that this is a theoretical roofline analysis. Real-world performance will depend on the efficiency of the implementation and XLA compilation on the TPU. Refer to the [link](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/optimization/sharding.html) for specific challenges regarding PP + FSDP/DP.
257
+
It is important to emphasize that this is a theoretical roofline analysis. Real-world performance will depend on the efficiency of the implementation and XLA compilation on the TPU. Refer to the [link](../optimization/sharding.md) for specific challenges regarding PP + FSDP/DP.
Copy file name to clipboardExpand all lines: docs/reference/architecture/jax_ai_libraries_chosen.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -60,7 +60,7 @@ For more information on using Orbax, please refer to https://github.com/google/o
60
60
61
61
Its APIs are explicitly designed for the multi-host paradigm, simplifying the process of ensuring that each host loads a unique shard of the global batch.
62
62
63
-
For more information on using Grain, please refer to https://github.com/google/grain and the grain guide in maxtext located at https://maxtext.readthedocs.io/en/latest/guides/data_input_pipeline/data_input_grain.html
63
+
For more information on using Grain, please refer to https://github.com/google/grain and the grain guide in maxtext located [here](../../guides/data_input_pipeline/data_input_grain.md).
Copy file name to clipboardExpand all lines: docs/reference/core_concepts/batch_size.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,11 +34,11 @@ You can set `per_device_batch_size` and `gradient_accumulation_steps` in `config
34
34
35
35
`global_batch_to_load` = `global_batch_size_to_train_on x expansion_factor_real_data`
36
36
37
-
When `expansion_factor_real_data > 1`, only a subset of hosts read data from the source (e.g., a GCS bucket). These "loading hosts" read more data than they need for their own devices and distribute the surplus to other "non-loading" hosts. This reduces the number of concurrent connections to the data source, which can significantly improve I/O throughput. When set to between 0 and 1, it's for grain pipeline to use a smaller chip count to read checkpoint from a larger chip count job. Details in https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/data_input_pipeline/data_input_grain.html#using-grain.
37
+
When `expansion_factor_real_data > 1`, only a subset of hosts read data from the source (e.g., a GCS bucket). These "loading hosts" read more data than they need for their own devices and distribute the surplus to other "non-loading" hosts. This reduces the number of concurrent connections to the data source, which can significantly improve I/O throughput. When set to between 0 and 1, it's for grain pipeline to use a smaller chip count to read checkpoint from a larger chip count job. Details [here](../../guides/data_input_pipeline/data_input_grain.md#using-grain).
38
38
39
39
## Gradient Accumulation Steps
40
40
41
-
`gradient_accumulation_steps` defines how many forward/backward passes are performed before the optimizer updates the model weights. The gradients from each pass are accumulated (summed). It is discussed in more detail [here](https://maxtext.readthedocs.io/en/latest/reference/core_concepts/tiling.html#gradient-accumulation).
41
+
`gradient_accumulation_steps` defines how many forward/backward passes are performed before the optimizer updates the model weights. The gradients from each pass are accumulated (summed). It is discussed in more detail [here](../core_concepts/tiling.md#gradient-accumulation).
42
42
43
43
For example, if `gradient_accumulation_steps` is set to `4`, the model will execute four forward and backward passes, sum the gradients, and then apply a single optimizer step. This achieves the same effective global batch size as quadrupling the `per_device_batch_size` with significantly less memory, but can potentially lead to lower MFU.
Copy file name to clipboardExpand all lines: docs/reference/core_concepts/tiling.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -80,4 +80,4 @@ Tiling is also crucial for managing data movement across the memory hierarchy (H
80
80
81
81
**Tiling** and **sharding** are independent concepts that do not conflict; in fact, they are often used together. Sharding distributes a tensor across multiple devices, while tiling processes a tensor in chunks on the same device.
82
82
83
-
To learn more about sharding in MaxText, please refer to the [sharding documentation](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/optimization/sharding.html).
83
+
To learn more about sharding in MaxText, please refer to the [sharding documentation](../../guides/optimization/sharding.md).
Copy file name to clipboardExpand all lines: docs/reference/models/supported_models_and_architectures.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@ MaxText is an open-source, high-performance LLM framework written in Python/JAX.
10
10
11
11
-**Supported Precisions**: FP32, BF16, INT8, and FP8.
12
12
-**Ahead-of-Time Compilation (AOT)**: For faster model development/prototyping and earlier OOM detection.
13
-
-**Quantization**: Via **Qwix** (recommended) and AQT. See Quantization [Guide](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/reference/core_concepts/quantization.html).
13
+
-**Quantization**: Via **Qwix** (recommended) and AQT. See Quantization [Guide](../reference/core_concepts/quantization.md).
14
14
-**Diagnostics**: Structured error context via **`cloud_tpu_diagnostics`** (filters stack traces to user code), simple logging via `max_logging`, profiling in **XProf**, and visualization in **TensorBoard**.
15
15
-**Multi-Token Prediction (MTP)**: Enables token efficient training with multi-token prediction.
16
16
-**Elastic Training**: Fault-tolerant and dynamic scale-up/scale-down on Cloud TPUs with Pathways.
Copy file name to clipboardExpand all lines: docs/tutorials/first_run.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,7 +36,7 @@ Local development is a convenient way to run MaxText on a single host. It doesn'
36
36
multiple hosts but is a good way to learn about MaxText.
37
37
38
38
1.[Create and SSH to the single host VM of your choice](https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm). You can use any available single host TPU, such as `v5litepod-8`, `v5p-8`, or `v4-8`.
39
-
2. For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). For this tutorial on TPUs, install `maxtext[tpu]`.
39
+
2. For instructions on installing MaxText on your VM, please refer to the [official documentation](../install_maxtext.md). For this tutorial on TPUs, install `maxtext[tpu]`.
40
40
3. After installation completes, run training on synthetic data with the following command:
41
41
42
42
```sh
@@ -70,7 +70,7 @@ You can use [demo_decoding.ipynb](https://github.com/AI-Hypercomputer/maxtext/bl
70
70
71
71
### Run MaxText on NVIDIA GPUs
72
72
73
-
1. For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). For this tutorial on GPUs, install `maxtext[cuda12]`.
73
+
1. For instructions on installing MaxText on your VM, please refer to the [official documentation](../install_maxtext.md). For this tutorial on GPUs, install `maxtext[cuda12]`.
74
74
2. After installation is complete, run training with the following command on synthetic data:
0 commit comments