diff --git a/README.md b/README.md
index ef8ccb4ddf..49329ef2a2 100644
--- a/README.md
+++ b/README.md
@@ -28,54 +28,56 @@ MaxText achieves high Model FLOPs Utilization (MFU) and tokens/second from singl
MaxText is the launching point for ambitious LLM projects both in research and production. We encourage you to start by experimenting with MaxText out of the box and then fork and modify MaxText to meet your needs.
-Check out our [Read The Docs site](https://maxtext.readthedocs.io/en/latest/) or directly [Get Started](https://maxtext.readthedocs.io/en/latest/tutorials/first_run.html) with your first MaxText run. If you’re interested in Diffusion models (Wan 2.1, Flux, etc), see the [MaxDiffusion](https://github.com/AI-Hypercomputer/maxdiffusion) repository in our AI Hypercomputer GitHub organization.
+Check out our [Read The Docs site](https://maxtext.readthedocs.io/en/latest/) or directly [Get Started](https://maxtext.readthedocs.io/en/latest/getting_started.html) with your first MaxText run. If you’re interested in Diffusion models (Wan 2.1, Flux, etc), see the [MaxDiffusion](https://github.com/AI-Hypercomputer/maxdiffusion) repository in our AI Hypercomputer GitHub organization.
## Installation
See our installation guide to [install MaxText with pip from PyPI](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#from-pypi-recommended).
## Decoupled mode
+
See our guide on running MaxText in decoupled mode, without any GCP dependencies in [Decoupled Mode Guide](https://maxtext.readthedocs.io/en/latest/run_maxtext/decoupled_mode.html).
## 🔥 Latest news 🔥
-* \[May 29, 2026\] A new eval framework is now available for running lm-eval, evalchemy, and custom benchmarks against MaxText checkpoints. See the [Evaluation guide](https://maxtext.readthedocs.io/en/latest/guides/eval_framework.html) to get started.
-* \[May 6, 2026\] Kimi-K2-Thinking, Kimi-K2.5 (text), and Kimi-K2.6 (text) are now supported. See [Run_Kimi.md](https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/end_to_end/tpu/kimi/Run_Kimi.md#quantized-variants-k2-thinking-k25-k26) for details.
-* \[April 18, 2026\] Added a new notebook [maxtext_with_gepa.ipynb](https://github.com/AI-Hypercomputer/maxtext/blob/3c7d8d27864fc12cccac07786f02bd0e5262c982/src/maxtext/examples/maxtext_with_gepa.ipynb) for optimizing AIME prompts using the GEPA framework with Maxtext.
-* \[April 14, 2026\] Legacy `MaxText.*` post-training shims have been removed. Please refer to [src/MaxText/README.md](https://github.com/AI-Hypercomputer/maxtext/blob/0536605a8ca116087ed93178433a67e905be566c/src/MaxText/README.md) for details on the new command locations and how to migrate.
-* \[April 13, 2026\] Kimi-K2 is now supported, along with MuonClip optimizer. Try the [kimi-k2-1t](https://github.com/AI-Hypercomputer/maxtext/blob/fa5b5ebf9a8e4f7a33bd88eae051dc21f3147791/src/maxtext/configs/models/kimi-k2-1t.yml) config and check the [user guide](https://github.com/AI-Hypercomputer/maxtext/blob/fa5b5ebf9a8e4f7a33bd88eae051dc21f3147791/tests/end_to_end/tpu/kimi/Run_Kimi.md).
-* \[April 10, 2026\] [DeepSeek-V3.2](https://arxiv.org/pdf/2512.02556) is now supported, featuring DeepSeek Sparse Attention for long context. Try it out with the [deepseek3.2-671b](https://github.com/AI-Hypercomputer/maxtext/blob/20d93f62a91899dbbb8f23562973d75104411d3a/src/maxtext/configs/models/deepseek3.2-671b.yml) config. See the [user guide](https://github.com/AI-Hypercomputer/maxtext/blob/20d93f62a91899dbbb8f23562973d75104411d3a/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md) for more details.
-* \[April 2, 2026\] Gemma 4 multi-modal models (26B MoE, 31B dense) are now supported! Try them out with our [gemma4-26b](https://github.com/AI-Hypercomputer/maxtext/blob/cdc587f0935a5e2d6f8287b96669cf2e87a0acdc/src/maxtext/configs/models/gemma4-26b.yml) and [gemma4-31b](https://github.com/AI-Hypercomputer/maxtext/blob/cdc587f0935a5e2d6f8287b96669cf2e87a0acdc/src/maxtext/configs/models/gemma4-31b.yml) configs. For more details, see [Run_Gemma4.md](https://github.com/AI-Hypercomputer/maxtext/blob/cdc587f0935a5e2d6f8287b96669cf2e87a0acdc/tests/end_to_end/tpu/gemma4/Run_Gemma4.md).
-* \[March 6, 2026\] New features from DeepSeek-AI are now supported: Conditional Memory via Scalable Lookup ([Engram](https://arxiv.org/abs/2601.07372)) and Manifold-Constrained Hyper-Connections ([mHC](https://arxiv.org/abs/2512.24880)). Try them out with our [deepseek-custom](https://github.com/AI-Hypercomputer/maxtext/blob/92f55e3d8123e225c2714a9bac40dd0a7f89efb0/src/maxtext/configs/models/deepseek-custom.yml) starter config.
-* \[March 5, 2026\] New `tpu-post-train` [target in PyPI](https://pypi.org/project/maxtext). Please also use this installation option for running vllm_decode. See the [MaxText installation instructions](https://maxtext.readthedocs.io/en/latest/install_maxtext.html) for more info.
-* \[March 5, 2026\] [Qwen3-Next](https://github.com/AI-Hypercomputer/maxtext/blob/7656eb8d1c9eb0dd91e617a6fdf6ad805221221a/tests/end_to_end/tpu/qwen/next/run_qwen3_next.md) is now supported.
-* \[February 27, 2026\] New MaxText structure! MaxText has been restructured according to [RESTRUCTURE.md](https://github.com/AI-Hypercomputer/maxtext/blob/1b9e38aa0a19b6018feb3aed757406126b6953a1/RESTRUCTURE.md). Please feel free to share your thoughts and feedback.
-* \[December 22, 2025\] [Muon optimizer](https://kellerjordan.github.io/posts/muon) is now supported.
-* \[December 10, 2025\] DeepSeek V3.1 is now supported. Use existing configs for [DeepSeek V3 671B](https://github.com/AI-Hypercomputer/maxtext/blob/7656eb8d1c9eb0dd91e617a6fdf6ad805221221a/src/maxtext/configs/models/deepseek3-671b.yml) and load in V3.1 checkpoint to use model.
+- [May 29, 2026] A new eval framework is now available for running lm-eval, evalchemy, and custom benchmarks against MaxText checkpoints. See the [Evaluation guide](https://maxtext.readthedocs.io/en/latest/guides/eval_framework.html) to get started.
+- [May 6, 2026] Kimi-K2-Thinking, Kimi-K2.5 (text), and Kimi-K2.6 (text) are now supported. See [Run_Kimi.md](https://github.com/AI-Hypercomputer/maxtext/blob/main/tests/end_to_end/tpu/kimi/Run_Kimi.md#quantized-variants-k2-thinking-k25-k26) for details.
+- [April 18, 2026] Added a new notebook [maxtext_with_gepa.ipynb](https://github.com/AI-Hypercomputer/maxtext/blob/3c7d8d27864fc12cccac07786f02bd0e5262c982/src/maxtext/examples/maxtext_with_gepa.ipynb) for optimizing AIME prompts using the GEPA framework with Maxtext.
+- [April 14, 2026] Legacy `MaxText.*` post-training shims have been removed. Please refer to [src/MaxText/README.md](https://github.com/AI-Hypercomputer/maxtext/blob/0536605a8ca116087ed93178433a67e905be566c/src/MaxText/README.md) for details on the new command locations and how to migrate.
+- [April 13, 2026] Kimi-K2 is now supported, along with MuonClip optimizer. Try the [kimi-k2-1t](https://github.com/AI-Hypercomputer/maxtext/blob/fa5b5ebf9a8e4f7a33bd88eae051dc21f3147791/src/maxtext/configs/models/kimi-k2-1t.yml) config and check the [user guide](https://github.com/AI-Hypercomputer/maxtext/blob/fa5b5ebf9a8e4f7a33bd88eae051dc21f3147791/tests/end_to_end/tpu/kimi/Run_Kimi.md).
+- [April 10, 2026] [DeepSeek-V3.2](https://arxiv.org/pdf/2512.02556) is now supported, featuring DeepSeek Sparse Attention for long context. Try it out with the [deepseek3.2-671b](https://github.com/AI-Hypercomputer/maxtext/blob/20d93f62a91899dbbb8f23562973d75104411d3a/src/maxtext/configs/models/deepseek3.2-671b.yml) config. See the [user guide](https://github.com/AI-Hypercomputer/maxtext/blob/20d93f62a91899dbbb8f23562973d75104411d3a/tests/end_to_end/tpu/deepseek/Run_DeepSeek.md) for more details.
+- [April 2, 2026] Gemma 4 multi-modal models (26B MoE, 31B dense) are now supported! Try them out with our [gemma4-26b](https://github.com/AI-Hypercomputer/maxtext/blob/cdc587f0935a5e2d6f8287b96669cf2e87a0acdc/src/maxtext/configs/models/gemma4-26b.yml) and [gemma4-31b](https://github.com/AI-Hypercomputer/maxtext/blob/cdc587f0935a5e2d6f8287b96669cf2e87a0acdc/src/maxtext/configs/models/gemma4-31b.yml) configs. For more details, see [Run_Gemma4.md](https://github.com/AI-Hypercomputer/maxtext/blob/cdc587f0935a5e2d6f8287b96669cf2e87a0acdc/tests/end_to_end/tpu/gemma4/Run_Gemma4.md).
+- [March 6, 2026] New features from DeepSeek-AI are now supported: Conditional Memory via Scalable Lookup ([Engram](https://arxiv.org/abs/2601.07372)) and Manifold-Constrained Hyper-Connections ([mHC](https://arxiv.org/abs/2512.24880)). Try them out with our [deepseek-custom](https://github.com/AI-Hypercomputer/maxtext/blob/92f55e3d8123e225c2714a9bac40dd0a7f89efb0/src/maxtext/configs/models/deepseek-custom.yml) starter config.
+- [March 5, 2026] New `tpu-post-train` [target in PyPI](https://pypi.org/project/maxtext). Please also use this installation option for running vllm_decode. See the [MaxText installation instructions](https://maxtext.readthedocs.io/en/latest/install_maxtext.html) for more info.
+- [March 5, 2026] [Qwen3-Next](https://github.com/AI-Hypercomputer/maxtext/blob/7656eb8d1c9eb0dd91e617a6fdf6ad805221221a/tests/end_to_end/tpu/qwen/next/run_qwen3_next.md) is now supported.
+- [February 27, 2026] New MaxText structure! MaxText has been restructured according to [RESTRUCTURE.md](https://github.com/AI-Hypercomputer/maxtext/blob/1b9e38aa0a19b6018feb3aed757406126b6953a1/RESTRUCTURE.md). Please feel free to share your thoughts and feedback.
+- [December 22, 2025] [Muon optimizer](https://kellerjordan.github.io/posts/muon) is now supported.
+- [December 10, 2025] DeepSeek V3.1 is now supported. Use existing configs for [DeepSeek V3 671B](https://github.com/AI-Hypercomputer/maxtext/blob/7656eb8d1c9eb0dd91e617a6fdf6ad805221221a/src/maxtext/configs/models/deepseek3-671b.yml) and load in V3.1 checkpoint to use model.
View full news archive
-* \[December 9, 2025\] [New RL and SFT Notebook tutorials](https://github.com/AI-Hypercomputer/maxtext/tree/7656eb8d1c9eb0dd91e617a6fdf6ad805221221a/src/maxtext/examples) are available.
-* \[December 4, 2025\] The [ReadTheDocs documentation site](https://maxtext.readthedocs.io/en/latest/index.html) has been reorganized.
-* \[December 3, 2025\] Multi-host support for GSPO and GRPO is now available via [new RL tutorials](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl_on_multi_host.html).
-* \[November 20, 2025\] A new guide, [What is Post Training in MaxText?](https://maxtext.readthedocs.io/en/latest/tutorials/post_training_index.html), is now available.
-* \[November 6, 2025\] Ironwood TPU co-designed AI stack announced. Read the [blog post on its co-design with MaxText](https://cloud.google.com/blog/products/compute/inside-the-ironwood-tpu-codesigned-ai-stack?e=48754805).
-* \[October 29, 2025\] [Optimized models tiering documentation](https://maxtext.readthedocs.io/en/latest/reference/models/tiering.html) has been refreshed.
-* \[October 12, 2025\] Added Versioning. Check out our [first set of release notes](https://maxtext.readthedocs.io/en/latest/release_notes.html)!
-* \[October 10, 2025\] Post-Training (SFT, RL) via [Tunix](https://github.com/google/tunix) is now available.
-* \[September 26, 2025\] Vocabulary tiling ([PR](https://github.com/AI-Hypercomputer/maxtext/pull/2242)) is now supported in MaxText! Adjust config `num_vocab_tiling` to unlock more efficient memory usage.
-* \[September 24, 2025\] The GPT-OSS family of models (20B, 120B) is now supported.
-* \[September 15, 2025\] MaxText is now available as a [PyPI package](https://pypi.org/project/maxtext). Users can now [install maxtext through pip](https://maxtext.readthedocs.io/en/latest/install_maxtext.html).
-* \[September 5, 2025\] MaxText has moved to an `src` layout as part of [RESTRUCTURE.md](https://github.com/AI-Hypercomputer/maxtext/blob/aca5b24931ebcbadb55a82e56ebffe8024874028/RESTRUCTURE.md). For existing environments, please run `pip install -e .` from MaxText root.
-* \[August 13, 2025\] The Qwen3 2507 MoE family of models is now supported: MoEs: 235B Thinking & 480B Coder as well as existing dense models: 0.6B, 4B, 8B, 14B, and 32B.
-* \[July 27, 2025\] Updated TFLOPS/s calculation ([PR](https://github.com/AI-Hypercomputer/maxtext/pull/1988)) to account for causal attention, dividing the attention flops in half. Accounted for sliding window and chunked attention reduced attention flops in [PR](https://github.com/AI-Hypercomputer/maxtext/pull/2009) and [PR](https://github.com/AI-Hypercomputer/maxtext/pull/2030). Changes impact large sequence configs, as explained in this [doc](https://maxtext.readthedocs.io/en/latest/reference/performance_metrics.html)
-* \[July 16, 2025\] We will be restructuring the MaxText repository for improved organization and clarity. Please review the [proposed structure](https://github.com/AI-Hypercomputer/maxtext/blob/aca5b24931ebcbadb55a82e56ebffe8024874028/RESTRUCTURE.md) and provide feedback.
-* \[July 11, 2025\] Multi-Token Prediction (MTP) training support\! Adds an auxiliary loss based on predicting multiple future tokens, inspired by [DeepSeek-V3 paper](https://arxiv.org/html/2412.19437v1), to enhance training efficiency.
-* \[June 25, 2025\] DeepSeek R1-0528 variant is now supported.
-* \[April 24, 2025\] Llama 4 Maverick models are now supported.
+- [December 9, 2025] [New RL and SFT Notebook tutorials](https://github.com/AI-Hypercomputer/maxtext/tree/7656eb8d1c9eb0dd91e617a6fdf6ad805221221a/src/maxtext/examples) are available.
+- [December 4, 2025] The [ReadTheDocs documentation site](https://maxtext.readthedocs.io/en/latest/index.html) has been reorganized.
+- [December 3, 2025] Multi-host support for GSPO and GRPO is now available via [new RL tutorials](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl_on_multi_host.html).
+- [November 20, 2025] A new guide, [What is Post Training in MaxText?](https://maxtext.readthedocs.io/en/latest/tutorials/post_training_index.html), is now available.
+- [November 6, 2025] Ironwood TPU co-designed AI stack announced. Read the [blog post on its co-design with MaxText](https://cloud.google.com/blog/products/compute/inside-the-ironwood-tpu-codesigned-ai-stack?e=48754805).
+- [October 29, 2025] [Optimized models tiering documentation](https://maxtext.readthedocs.io/en/latest/reference/models/tiering.html) has been refreshed.
+- [October 12, 2025] Added Versioning. Check out our [first set of release notes](https://maxtext.readthedocs.io/en/latest/release_notes.html)!
+- [October 10, 2025] Post-Training (SFT, RL) via [Tunix](https://github.com/google/tunix) is now available.
+- [September 26, 2025] Vocabulary tiling ([PR](https://github.com/AI-Hypercomputer/maxtext/pull/2242)) is now supported in MaxText! Adjust config `num_vocab_tiling` to unlock more efficient memory usage.
+- [September 24, 2025] The GPT-OSS family of models (20B, 120B) is now supported.
+- [September 15, 2025] MaxText is now available as a [PyPI package](https://pypi.org/project/maxtext). Users can now [install maxtext through pip](https://maxtext.readthedocs.io/en/latest/install_maxtext.html).
+- [September 5, 2025] MaxText has moved to an `src` layout as part of [RESTRUCTURE.md](https://github.com/AI-Hypercomputer/maxtext/blob/aca5b24931ebcbadb55a82e56ebffe8024874028/RESTRUCTURE.md). For existing environments, please run `pip install -e .` from MaxText root.
+- [August 13, 2025] The Qwen3 2507 MoE family of models is now supported: MoEs: 235B Thinking & 480B Coder as well as existing dense models: 0.6B, 4B, 8B, 14B, and 32B.
+- [July 27, 2025] Updated TFLOPS/s calculation ([PR](https://github.com/AI-Hypercomputer/maxtext/pull/1988)) to account for causal attention, dividing the attention flops in half. Accounted for sliding window and chunked attention reduced attention flops in [PR](https://github.com/AI-Hypercomputer/maxtext/pull/2009) and [PR](https://github.com/AI-Hypercomputer/maxtext/pull/2030). Changes impact large sequence configs, as explained in this [doc](https://maxtext.readthedocs.io/en/latest/reference/performance_metrics.html)
+- [July 16, 2025] We will be restructuring the MaxText repository for improved organization and clarity. Please review the [proposed structure](https://github.com/AI-Hypercomputer/maxtext/blob/aca5b24931ebcbadb55a82e56ebffe8024874028/RESTRUCTURE.md) and provide feedback.
+- [July 11, 2025] Multi-Token Prediction (MTP) training support! Adds an auxiliary loss based on predicting multiple future tokens, inspired by [DeepSeek-V3 paper](https://arxiv.org/html/2412.19437v1), to enhance training efficiency.
+- [June 25, 2025] DeepSeek R1-0528 variant is now supported.
+- [April 24, 2025] Llama 4 Maverick models are now supported.
+
@@ -89,7 +91,7 @@ In addition to pure text-based LLMs, we also support multi-modal training with G
### Pre-training
-If you’re building models from scratch, MaxText can serve as a reference implementation for experimentation, ideation, and inspiration \- just fork and modify MaxText to train your model, whether it’s a small dense model like Llama 8B, or a large MoE like DeepSeek-V3. Experiment with configs and model design to build the most efficient model on TPU or GPU.
+If you’re building models from scratch, MaxText can serve as a reference implementation for experimentation, ideation, and inspiration - just fork and modify MaxText to train your model, whether it’s a small dense model like Llama 8B, or a large MoE like DeepSeek-V3. Experiment with configs and model design to build the most efficient model on TPU or GPU.
MaxText provides opinionated implementations for how to achieve optimal performance across a wide variety of dimensions like sharding, quantization, and checkpointing.
@@ -97,16 +99,16 @@ MaxText provides opinionated implementations for how to achieve optimal performa
If you are post-training a model, whether it is proprietary or open source, MaxText provides a scalable framework using Tunix. For RL (like GRPO), we leverage vLLM for sampling and Pathways (soon) for multi-host.
-Our goal is to provide a variety of models (dimension “a”) and techniques (dimension “b”), so you can easily explore (a) \* (b) combinations and efficiently train the perfect model for your use case.
+Our goal is to provide a variety of models (dimension “a”) and techniques (dimension “b”), so you can easily explore (a) * (b) combinations and efficiently train the perfect model for your use case.
Check out these getting started guides:
-* Supervised Fine Tuning (SFT)
- * [SFT on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/sft.html)
- * [SFT on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/sft_on_multi_host.html)
-* Reinforcement Learning (RL)
- * [RL on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl.html)
- * [RL on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl_on_multi_host.html)
+- Supervised Fine Tuning (SFT)
+ - [SFT on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/sft.html)
+ - [SFT on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/sft_on_multi_host.html)
+- Reinforcement Learning (RL)
+ - [RL on Single-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl.html)
+ - [RL on Multi-Host TPUs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl_on_multi_host.html)
### Model library
@@ -114,36 +116,36 @@ MaxText aims to provide you with the best OSS models, whether as a reference imp
**Supported JAX models in MaxText**
-* Google
- * Gemma 4 (26B MoE, 31B Dense)
- * Gemma 3 (4B, 12B, 27B)
- * Gemma 2 (2B, 9B, 27B)
- * Gemma 1 (2B, 7B)
-* Alibaba
- * Qwen 3 Next (80B)
- * Qwen 3 MoE (30B, 235B), Qwen 3 MoE 2507 (235B, 480B)
- * Qwen 3 Dense (0.6B, 1.7B, 4B, 8B, 14B, 32B)
- * Qwen 2.5 (1.5B, 7B, 14B)
-* DeepSeek AI
- * DeepSeek V3.2 (671B)
- * DeepSeek V3.1 (671B)
- * DeepSeek V3 0324 (671B), DeepSeek R1 0528 (671B)
- * DeepSeek V2 (16B, 236B)
-* Moonshot AI
- * Kimi K2 (1T), Kimi K2-Thinking (1T)
- * Kimi K2.5 (text), Kimi K2.6 (text)
-* Meta
- * Llama 4 Scout (109B) & Maverick (400B)
- * Llama 3.3 (70B), 3.1 (8B, 70B, 405B), 3.0 (8B, 70B, 405B)
- * Llama 2 (7B, 13B, 70B)
-* OpenAI
- * GPT-OSS (20B, 120B)
- * GPT3 (52K, 6B, 22B, 175B)
-* Mistral AI
- * Mixtral (8x7B, 8x22B)
- * Mistral (7B)
-* Diffusion Models
- * See [MaxDiffusion](https://github.com/AI-Hypercomputer/maxdiffusion) (LTXV, Wan 2.1, Flux, SDXL, etc)
+- Google
+ - Gemma 4 (26B MoE, 31B Dense)
+ - Gemma 3 (4B, 12B, 27B)
+ - Gemma 2 (2B, 9B, 27B)
+ - Gemma 1 (2B, 7B)
+- Alibaba
+ - Qwen 3 Next (80B)
+ - Qwen 3 MoE (30B, 235B), Qwen 3 MoE 2507 (235B, 480B)
+ - Qwen 3 Dense (0.6B, 1.7B, 4B, 8B, 14B, 32B)
+ - Qwen 2.5 (1.5B, 7B, 14B)
+- DeepSeek AI
+ - DeepSeek V3.2 (671B)
+ - DeepSeek V3.1 (671B)
+ - DeepSeek V3 0324 (671B), DeepSeek R1 0528 (671B)
+ - DeepSeek V2 (16B, 236B)
+- Moonshot AI
+ - Kimi K2 (1T), Kimi K2-Thinking (1T)
+ - Kimi K2.5 (text), Kimi K2.6 (text)
+- Meta
+ - Llama 4 Scout (109B) & Maverick (400B)
+ - Llama 3.3 (70B), 3.1 (8B, 70B, 405B), 3.0 (8B, 70B, 405B)
+ - Llama 2 (7B, 13B, 70B)
+- OpenAI
+ - GPT-OSS (20B, 120B)
+ - GPT3 (52K, 6B, 22B, 175B)
+- Mistral AI
+ - Mixtral (8x7B, 8x22B)
+ - Mistral (7B)
+- Diffusion Models
+ - See [MaxDiffusion](https://github.com/AI-Hypercomputer/maxdiffusion) (LTXV, Wan 2.1, Flux, SDXL, etc)
## Get involved
diff --git a/docs/getting_started.md b/docs/getting_started.md
new file mode 100644
index 0000000000..46a7c27402
--- /dev/null
+++ b/docs/getting_started.md
@@ -0,0 +1,65 @@
+
+
+(getting-started)=
+
+# Getting Started
+
+Welcome to MaxText! This guide will help you get started with running your first MaxText workloads. Whether you are working on a single host or scaling up to a multihost environment using Cloud TPUs or NVIDIA GPUs, this page provides the starting point for your journey. Follow the steps below to install MaxText, train your first model, and run inference.
+
+## Prerequisites
+
+1. To store logs and checkpoints, [create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) in your project. To run MaxText, the TPU or GPU VMs must have read/write permissions for the bucket. These permissions are granted by service account roles, such as the `STORAGE ADMIN` role.
+
+1. MaxText reads a yaml file for configuration. We also recommend reviewing the configurable options in `configs/base.yml`. This file includes a decoder-only model of ~1B parameters. The configurable options can be overwritten from the command line. For instance, you can change the `steps` or `log_period` by either modifying `configs/base.yml` or by passing in `steps` and `log_period` as additional arguments to the `train.py` call. Set `base_output_directory` to a folder in the bucket you just created.
+
+1. **Checkpoint Conversion**: In order to run MaxText on HuggingFace checkpoints, you must convert them to the MaxText/Orbax format first. For detailed instructions, see the [Checkpoint Conversion Guide](guides/checkpointing_solutions/convert_checkpoint.md).
+
+## Running MaxText on a Single Host
+
+This procedure describes how to run MaxText on a single GPU or TPU host.
+
+### 1. Installation
+
+Before running MaxText, you must install it on your VM.
+
+- For detailed installation instructions, see the [Installation Guide](install_maxtext.md).
+- For TPU VMs, install `maxtext[tpu]` for pre-training, or `maxtext[tpu-post-train]` for post-training.
+- For GPU VMs, ensure you install `maxtext[cuda12]`.
+
+### 2. Running Pre-training
+
+To get started with training your first model, refer to the [Pre-training Tutorial](tutorials/pretraining.md).
+
+### 3. Running Post-training
+
+To fine-tune your model or apply post-training techniques (such as SFT or RL), refer to the [Post-training Tutorial](tutorials/post_training_index.md). This guide covers various post-training workflows.
+
+### 4. Running Inference
+
+To run inference (decoding) using MaxText models, refer to the [Inference Tutorial](tutorials/inference.md). This guide covers offline and online inference, as well as integration with vLLM.
+
+## Running MaxText on Multiple Hosts
+
+Google Kubernetes Engine (GKE) is the recommended way to run MaxText on multiple hosts. It provides a managed environment for deploying and scaling containerized applications, including those that require TPUs or GPUs. See [Running Maxtext with XPK](run_maxtext/run_maxtext_via_xpk.md) for details.
+
+## Running MaxText in Notebooks
+
+You can run MaxText interactively using Jupyter notebooks, Google Colab, or Visual Studio Code. Refer to the [Notebook Guide](guides/run_python_notebook.md) for instructions on setting up your notebook environment on TPUs.
+
+## Next steps: preflight optimizations
+
+After you get workloads running, there are optimizations you can apply to improve performance. For more information, see [Optimization Tips](guides/optimization.md).
diff --git a/docs/index.md b/docs/index.md
index e1058748a2..1c46eaff94 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -42,6 +42,7 @@ end-before:
maxdepth: 2
hidden:
---
+getting_started
install_maxtext
tutorials
run_maxtext
diff --git a/docs/tutorials.md b/docs/tutorials.md
index 3839829003..666c38e9d8 100644
--- a/docs/tutorials.md
+++ b/docs/tutorials.md
@@ -22,13 +22,6 @@ Explore our tutorials to learn how to use MaxText, from your first run to advanc
---
gutter: 2
---
-```{grid-item-card} 🚀 Getting Started
-:link: tutorials/first_run
-:link-type: doc
-
-Installation, prerequisites, verification, and your first training run.
-```
-
```{grid-item-card} 🛠️ Build and upload MaxText Docker Images
---
link: tutorials/build_maxtext
@@ -64,7 +57,6 @@ Step-by-step guides for running inference of MaxText models on vLLM.
hidden:
maxdepth: 1
---
-tutorials/first_run.md
tutorials/build_maxtext.md
tutorials/pretraining.md
tutorials/post_training_index.md
diff --git a/docs/tutorials/first_run.md b/docs/tutorials/first_run.md
deleted file mode 100644
index ab98ece2f1..0000000000
--- a/docs/tutorials/first_run.md
+++ /dev/null
@@ -1,105 +0,0 @@
-
-
-(first-run)=
-
-# Getting started: First run
-
-This topic provides a basic introduction to get your MaxText workload up and running on single host and multihost environments using Cloud TPUs or NVIDIA GPUs. To help you get familiar with MaxText, we recommend starting with a single host first and then moving to multihost.
-
-## Prerequisites: Set up storage and configure MaxText
-
-1. To store logs and checkpoints, [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) in your project. To run MaxText, the TPU or GPU VMs must have read/write permissions for the bucket. These permissions are granted by service account roles, such as the `STORAGE ADMIN` role.
-
-2. MaxText reads a yaml file for configuration. We also recommend reviewing the configurable options in `configs/base.yml`. This file includes a decoder-only model of ~1B parameters. The configurable options can be overwritten from the command line. For instance, you can change the `steps` or `log_period` by either modifying `configs/base.yml` or by passing in `steps` and `log_period` as additional arguments to the `train.py` call. Set `base_output_directory` to a folder in the bucket you just created.
-
-## Local development for single host
-
-This procedure describes how to run MaxText on a single GPU or TPU host.
-
-### Run MaxText on cloud TPUs
-
-Local development is a convenient way to run MaxText on a single host. It doesn't scale to
-multiple hosts but is a good way to learn about MaxText.
-
-1. [Create and SSH to the single host VM of your choice](https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm). You can use any available single host TPU, such as `v5litepod-8`, `v5p-8`, or `v4-8`.
-2. For instructions on installing MaxText on your VM, please refer to the [official documentation](../install_maxtext.md). For this tutorial on TPUs, install `maxtext[tpu]`.
-3. After installation completes, run training on synthetic data with the following command:
-
-```sh
-python3 -m maxtext.trainers.pre_train.train \
- run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
- dataset_type=synthetic \
- steps=10
-```
-
-Optional: If you want to try training on a Hugging Face dataset, see [Data Input Pipeline](../guides/data_input_pipeline.md) for data input options.
-
-4. To demonstrate model output, run the following command:
-
-```sh
-python3 -m maxtext.inference.decode \
- run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
- per_device_batch_size=1
-```
-
-This command uses a model with randomly initialized weights, so the outputs are also random. To get high quality output you need pass in a checkpoint, typically via the `load_parameters_path` argument.
-
-### Run MaxText via notebook
-
-In the same TPU VM where you just installed all the dependencies of MaxText, You can also run training and decoding in MaxText via Notebook (for e.g., via Jupyter or Colab).
-
-#### Decoding in MaxText via notebook
-
-You can use [demo_decoding.ipynb](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/examples/demo_decoding.ipynb) to try out decoding on MaxText's `Llama3.1-8b` model implementation. In this notebook, we give `"I love to"` as the prompt, and the greedily sampled first output token is `" cook"`. Please remember to provide the path to your `Llama3.1-8b` checkpoint for the `load_parameters_path` argument in the config inside the notebook. You can use [to_maxtext.py](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/checkpoint_conversion/to_maxtext.py) to create a MaxText/Orbax checkpoint from a Huggingface checkpoint.
-
-### Run MaxText on NVIDIA GPUs
-
-1. For instructions on installing MaxText on your VM, please refer to the [official documentation](../install_maxtext.md). For this tutorial on GPUs, install `maxtext[cuda12]`.
-2. After installation is complete, run training with the following command on synthetic data:
-
-```sh
-python3 -m maxtext.trainers.pre_train.train \
- run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
- dataset_type=synthetic \
- steps=10
-```
-
-3. To demonstrate model output, run the following command:
-
-```sh
-python3 -m maxtext.inference.decode \
- run_name=${YOUR_JOB_NAME?} \
- base_output_directory=gs:// \
- per_device_batch_size=1
-```
-
-If you see the following error when running inside a container, set a larger `--shm-size` (for example, `--shm-size=1g`):
-
-```
-Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:297: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details); current tracing scope: all-reduce-start.2; current profiling annotation: XlaModule:#hlo_module=jit__unnamed_wrapped_function_,program_id=7#.
-```
-
-## Multihost development
-
-Google Kubernetes Engine (GKE) is the recommended way to run MaxText on multiple hosts. It provides a managed environment for deploying and scaling containerized applications, including those that require TPUs or GPUs. See [Running Maxtext with XPK](../run_maxtext/run_maxtext_via_xpk.md) for details.
-
-## Next steps: preflight optimizations
-
-After you get workloads running, there are optimizations you can apply to improve performance. For more information, see [PREFLIGHT.md](https://github.com/google/maxtext/blob/main/PREFLIGHT.md).