Skip to content

Commit 03cabc3

Browse files
Merge pull request #3671 from melissawm:fix-links
PiperOrigin-RevId: 934447338
2 parents 1432864 + 6626678 commit 03cabc3

27 files changed

Lines changed: 69 additions & 47 deletions

docs/guides/checkpointing_solutions/convert_checkpoint.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
(checkpoint-conversion)=
2+
13
# Checkpoint Conversion Utilities
24

35
This guide provides instructions to use [checkpoint conversion scripts](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/checkpoint_conversion) to convert model checkpoints bidirectionally between Hugging Face and MaxText formats.
@@ -23,10 +25,12 @@ The following models are supported:
2325

2426
## Prerequisites
2527

26-
- MaxText must be installed in a Python virtual environment using the `maxtext[tpu]` option. For instructions on installing MaxText on your VM, please refer to the official [installation documentation](../../install_maxtext.md).
28+
- MaxText must be installed in a Python virtual environment using the `maxtext[tpu]` option. For instructions on installing MaxText on your VM, please refer to the official [installation documentation](install-from-source).
2729
- Hugging Face model checkpoints are cached locally at `$HOME/.cache/huggingface/hub` before conversion. Ensure you have sufficient disk space.
2830
- Authenticate via the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/v0.21.2/guides/cli) if using private or gated models.
2931

32+
(hf-to-maxtext)=
33+
3034
## Hugging Face to MaxText
3135

3236
Use the `to_maxtext.py` script to convert a Hugging Face model checkpoint into a MaxText checkpoint. The script will automatically download the specified model from the Hugging Face Hub, perform conversion, and save converted checkpoints to the given output directory.
@@ -74,7 +78,7 @@ You can find your converted checkpoint files under `${BASE_OUTPUT_DIRECTORY}/0/i
7478
### Key Parameters
7579

7680
- `model_name`: The specific model identifier. It must match a supported entry in the MaxText [globals.py](https://github.com/AI-Hypercomputer/maxtext/blob/16b684840db9b96b19e24e84ac49f06af7204ae3/src/maxtext/utils/globals.py#L46C1-L46C7).
77-
- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information. **IMPORTANT:** This setting *must* match the `scan_layers` value used during model training or loading. A mismatch will cause PyTree loading errors (though MaxText will intercept these and raise a descriptive `ValueError` explaining the mismatch).
81+
- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [to the Checkpoints guide](checkpoints) for more information. **IMPORTANT:** This setting *must* match the `scan_layers` value used during model training or loading. A mismatch will cause PyTree loading errors (though MaxText will intercept these and raise a descriptive `ValueError` explaining the mismatch).
7882
- `use_multimodal`: Indicates if multimodality is used, important for Gemma3.
7983
- `base_output_directory`: The path where the converted Orbax checkpoint will be stored; it can be Google Cloud Storage (GCS) or local.
8084
- `hardware=cpu`: The conversion script runs on a CPU machine.
@@ -126,7 +130,7 @@ python3 -m maxtext.checkpoint_conversion.to_huggingface \
126130

127131
- `model_name`: The specific model identifier. It must match a supported entry in the MaxText [globals.py](https://github.com/AI-Hypercomputer/maxtext/blob/16b684840db9b96b19e24e84ac49f06af7204ae3/src/maxtext/utils/globals.py#L46C1-L46C7).
128132
- `load_parameters_path`: The path to the MaxText Orbax checkpoint.
129-
- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information.
133+
- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [to the Checkpoints guide](checkpoints) for more information.
130134
- `use_multimodal`: Indicates if multimodality is used, important for Gemma3.
131135
- `hardware=cpu`: The conversion script runs on a CPU machine.
132136
- `base_output_directory`: The path where the converted checkpoint will be stored; it can be Google Cloud Storage (GCS), Hugging Face Hub or local.
@@ -136,7 +140,7 @@ python3 -m maxtext.checkpoint_conversion.to_huggingface \
136140

137141
To ensure the conversion was successful, you can use the [test script](https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/utils/forward_pass_logit_checker.py). It runs a forward pass on both the original and converted models and compares the output logits to verify conversion. It is used to verify the bidirectional conversion.
138142

139-
> **Note:** This correctness test will only work when MaxText is installed from source by following the installation instructions [here](../../install_maxtext.md#from-source).
143+
> **Note:** This correctness test will only work when MaxText is installed from source by following the installation instructions [here](install-from-source).
140144
141145
### Setup Environment
142146

@@ -177,7 +181,7 @@ python3 -m tests.utils.forward_pass_logit_checker src/maxtext/configs/base.yml \
177181

178182
- `load_parameters_path`: The path to the MaxText Orbax checkpoint (e.g., `gs://your-bucket/maxtext-checkpoint/0/items`).
179183
- `model_name`: The corresponding model name in the MaxText configuration (e.g., `qwen3-4b`).
180-
- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information.
184+
- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [to the Checkpoints guide](checkpoints) for more information.
181185
- `use_multimodal`: Indicates if multimodality is used.
182186
- `--run_hf_model` (Optional): Indicates if loading Hugging Face model from the hf_model_path. If not set, it will compare the maxtext logits with pre-saved golden logits.
183187
- `--hf_model_path` (Optional): The path to the Hugging Face checkpoint (if `--run_hf_model=True`).

docs/guides/data_input_pipeline.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ Currently MaxText has three data input pipelines:
2626
| **[Hugging Face](data_input_pipeline/data_input_hf.md)** | datasets in [Hugging Face Hub](https://huggingface.co/datasets)<br>local/Cloud Storage datasets in json, parquet, arrow, csv, txt (sequential access) | no download needed, convenience; <br>multiple formats | limit scalability using the Hugging Face Hub (no limit using Cloud Storage); <br>non-deterministic with preemption<br>(deterministic without preemption)<br> |
2727
| **[TFDS](data_input_pipeline/data_input_tfds.md)** | TFRecord (sequential access), available through [Tensorflow Datasets](https://www.tensorflow.org/datasets/catalog/overview) | performant | only supports TFRecords; <br>non-deterministic with preemption<br>(deterministic without preemption) |
2828

29+
(multihost-dataloading-best-practice)=
30+
2931
## Multihost dataloading best practice
3032

3133
Training in a multi-host environment presents unique challenges for data input pipelines. An effective data loading strategy must address three key issues:

docs/guides/data_input_pipeline/data_input_grain.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
(grain-pipeline)=
2+
13
# Grain pipeline
24

35
## The recommended input pipeline for determinism and resilience!
@@ -30,6 +32,8 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state
3032
- **Global shuffle**: This feature is only available when using Grain with [ArrayRecord](https://github.com/google/array_record) (random access) format, achieved by shuffling indices globally at the beginning of each epoch and then reading the elements according to the random order. This shuffle method effectively prevents local overfitting, leading to better training results.
3133
- **Hierarchical shuffle**: For sequential access format [Parquet](https://arrow.apache.org/docs/python/parquet.html), shuffle is performed by these steps: file shuffling, interleave from files, and window shuffle using a fixed size buffer.
3234

35+
(using-grain)=
36+
3337
## Using Grain
3438

3539
1. Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class.

docs/guides/model_bringup.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,15 @@ This documentation acts as the primary resource for efficiently integrating new
2020

2121
## 1. Architecture Analysis
2222

23-
The first phase involves determining how the new model's architecture aligns with MaxText's existing capabilities. To facilitate this assessment, refer to the [MaxText architecture overview](../reference/architecture/architecture_overview.md) and [list of supported models](../reference/models/supported_models_and_architectures.md).
23+
The first phase involves determining how the new model's architecture aligns with MaxText's existing capabilities. To facilitate this assessment, refer to the [MaxText architecture overview](architecture-overview) and [list of supported models](supported-models).
2424

25-
**Input Data Pipeline**: MaxText supports HuggingFace, Grain, and TFDS pipelines ([details](data_input_pipeline.md)). While synthetic data is typically used for initial performance benchmarks, the framework supports multiple modalities including text and image (audio and video - work in progress).
25+
**Input Data Pipeline**: MaxText supports HuggingFace, Grain, and TFDS pipelines ([details](data-input-pipeline)). While synthetic data is typically used for initial performance benchmarks, the framework supports multiple modalities including text and image (audio and video - work in progress).
2626

2727
**Tokenizer**: Supported [tokenizer options](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/input_pipeline/tokenizer.py) include `TikTokenTokenizer`, `SentencePieceTokenizer`, and `HFTokenizer`.
2828

2929
**Self-Attention & RoPE**: Available mechanisms include optimized [Flash Attention](https://github.com/AI-Hypercomputer/maxtext/blob/62ee818144eb037ad3fe85ab8e789cd074776f46/src/maxtext/layers/attention_op.py#L1184) (supporting MHA, GQA, and MQA), Multi-head Latent Attention ([MLA](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/layers/attention_mla.py)), and [Gated Delta Network](https://github.com/AI-Hypercomputer/maxtext/blob/62ee818144eb037ad3fe85ab8e789cd074776f46/src/maxtext/models/qwen3.py#L358). MaxText also supports [Regular](https://github.com/AI-Hypercomputer/maxtext/blob/88d2ffd34c0ace76f836c7ea9c2fe4cd2d271088/MaxText/layers/embeddings.py#L108), [Llama](https://github.com/AI-Hypercomputer/maxtext/blob/88d2ffd34c0ace76f836c7ea9c2fe4cd2d271088/MaxText/layers/embeddings.py#L178), and [YaRN](https://github.com/AI-Hypercomputer/maxtext/blob/88d2ffd34c0ace76f836c7ea9c2fe4cd2d271088/MaxText/layers/embeddings.py#L282) variations of Rotary Positional Embeddings (RoPE).
3030

31-
**Multi-Layer Perceptron (MLP)**: The framework supports both traditional dense models and Mixture of Experts (MoE) architectures, including [configurations](../reference/core_concepts/moe_configuration.md) for routed and shared experts.
31+
**Multi-Layer Perceptron (MLP)**: The framework supports both traditional dense models and Mixture of Experts (MoE) architectures, including [configurations](moe-configuration) for routed and shared experts.
3232

3333
**Normalization**: We support different [normalization strategies](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/layers/normalizations.py), including RMSNorm and Gated RMSNorm. These can be configured before or after attention/MLP layers.
3434

@@ -44,7 +44,7 @@ This step can be bypassed if the current MaxText codebase already supports all c
4444

4545
While most open-source models are distributed in Safetensors or PyTorch formats, MaxText requires conversion to the [Orbax](https://orbax.readthedocs.io/en/latest/) format.
4646

47-
There are [two primary formats](../reference/core_concepts/checkpoints.md) for Orbax checkpoints within MaxText, and while both are technically compatible with training and inference, we recommend following these performance-optimized guidelines:
47+
There are [two primary formats](checkpoints) for Orbax checkpoints within MaxText, and while both are technically compatible with training and inference, we recommend following these performance-optimized guidelines:
4848

4949
- **Scanned Format**: Recommended for **training** as it stacks layers for efficient processing via `jax.lax.scan`. To enable this, set `scan_layers=True`.
5050
- **Unscanned Format**: Recommended for **inference** to simplify loading individual layer parameters. To enable this, set `scan_layers=False`.
@@ -58,7 +58,7 @@ Success starts with a clear map. You must align the parameter names from your so
5858

5959
### 3.2 Write Script
6060

61-
Use existing model scripts within the repository as templates to tailor the conversion logic for your specific architecture. We strongly recommended to use the [checkpoint conversion utility](checkpointing_solutions/convert_checkpoint.md) rather than [standalone scripts](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/checkpoint_conversion/standalone_scripts).
61+
Use existing model scripts within the repository as templates to tailor the conversion logic for your specific architecture. We strongly recommend using the [checkpoint conversion utility](checkpoint-conversion) rather than [standalone scripts](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/checkpoint_conversion/standalone_scripts).
6262

6363
### 3.3 Verify Compatibility
6464

@@ -132,7 +132,7 @@ If you run the `forward_pass_logit_checker.py` to compare reference logits with
132132

133133
**Q: How to compile models for a target hardware without physical access?**
134134

135-
**A:** If you need to compile your training run ahead of time, use the train_compile.py tool. This utility allows you to compile the primary train_step for specific target hardware without needing the actual devices on hand. It’s particularly useful for verifying your implementation's functionality on a local Cloud VM or a standard CPU. Please refer [here](monitoring_and_debugging/features_and_diagnostics.md#ahead-of-time-compilation-aot) for more examples.
135+
**A:** If you need to compile your training run ahead of time, use the `train_compile.py` tool. This utility allows you to compile the primary `train_step` for specific target hardware without needing the actual devices on hand. It’s particularly useful for verifying your implementation's functionality on a local Cloud VM or a standard CPU. Please refer [here](aot-compilation) for more examples.
136136

137137
**Q: My model is too large for my development machine. What should I do?**
138138

docs/guides/optimization/custom_model.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,7 @@ Ironwood over ICI:
254254
- `3 * M * 8 / 2 > 12800`
255255
- `M > 1100`
256256

257-
It is important to emphasize that this is a theoretical roofline analysis. Real-world performance will depend on the efficiency of the implementation and XLA compilation on the TPU. Refer to the [link](../optimization/sharding.md) for specific challenges regarding PP + FSDP/DP.
257+
It is important to emphasize that this is a theoretical roofline analysis. Real-world performance will depend on the efficiency of the implementation and XLA compilation on the TPU. Refer to the [link](sharding_on_TPUs) for specific challenges regarding PP + FSDP/DP.
258258

259259
## Step 4. Analyze experiments
260260

docs/guides/run_python_notebook.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ To install, click the `Extensions` icon on the left sidebar (or press `Ctrl+Shif
8686

8787
### Step 3: Install MaxText and Dependencies
8888

89-
To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](../install_maxtext.md#from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
89+
To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](install-from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
9090

9191
> **Note:** If you have previously installed MaxText with a different option (e.g., `maxtext[tpu]`), we strongly recommend using a fresh virtual environment for `maxtext[tpu-post-train]` to avoid potential library version conflicts.
9292
@@ -139,7 +139,7 @@ pip3 install jupyterlab
139139

140140
### Step 3: Install MaxText and Dependencies
141141

142-
To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](../install_maxtext.md#from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
142+
To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](install-from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
143143

144144
> **Note:** If you have previously installed MaxText with a different option (e.g., `maxtext[tpu]`), we strongly recommend using a fresh virtual environment for `maxtext[tpu-post-train]` to avoid potential library version conflicts.
145145

docs/install_maxtext.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ This is the easiest way to get started with the latest stable version.
7777
access to the `build_maxtext_docker_image`, `upload_maxtext_docker_image`,
7878
and `xpk` commands. For more details on building and uploading Docker
7979
images, see the
80-
[Build MaxText Docker Image](build_maxtext.md)
80+
[Build MaxText Docker Image](build-docker)
8181
guide.
8282
8383
```bash

docs/reference/architecture/architecture_overview.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
(architecture-overview)=
2+
13
# Architecture overview
24

35
## The MaxText philosophy

docs/reference/architecture/jax_ai_libraries_chosen.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,11 +56,11 @@ For more information on using Orbax, please refer to https://github.com/google/o
5656

5757
1. **Deterministic by Design**: Grain allows storing data loader states, provides strong guarantees about data ordering and sharding even with preemptions, which is critical for reproducibility.
5858
2. **Global Shuffle**: Prevents local overfitting.
59-
3. **Built for Multi-Host Training**: The using random access file format streamlines [data loading in the multi-host environments](../../guides/data_input_pipeline.md#multihost-dataloading-best-practice).
59+
3. **Built for Multi-Host Training**: The using random access file format streamlines [data loading in the multi-host environments](multihost-dataloading-best-practice).
6060

6161
Its APIs are explicitly designed for the multi-host paradigm, simplifying the process of ensuring that each host loads a unique shard of the global batch.
6262

63-
For more information on using Grain, please refer to https://github.com/google/grain and the grain guide in maxtext located [here](../../guides/data_input_pipeline/data_input_grain.md).
63+
For more information on using Grain, please refer to https://github.com/google/grain and the [Grain guide in MaxText](grain-pipeline).
6464

6565
## Qwix: For native JAX quantization
6666

0 commit comments

Comments
 (0)