AI-Hypercomputer
diff --git a/‎docs/development/contribute_docs.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/development/contribute_docs.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/development/update_dependencies.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/development/update_dependencies.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/guides/model_bringup.md‎
Lines changed: 6 additions & 6 deletions b/‎docs/guides/model_bringup.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎docs/guides/run_python_notebook.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/guides/run_python_notebook.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/index.html‎
Lines changed: 1 addition & 1 deletion b/‎docs/index.html‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/install_maxtext.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/install_maxtext.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/reference/architecture/jax_ai_libraries_chosen.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/reference/architecture/jax_ai_libraries_chosen.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/reference/models/tiering.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/reference/models/tiering.md‎
Lines changed: 1 addition & 1 deletion
@@ -27,7 +27,7 @@ documentation site locally to ensure things work as expected before a deployment
 to [Read The Docs](https://about.readthedocs.com/?ref=app.readthedocs.org).
 
 First, make sure you
-[install MaxText from source](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#from-source)
+[install MaxText from source](../install_maxtext.md#from-source)
 and install the necessary dependencies. You can do this by navigating to your
 local clone of the MaxText repo and running:
 
 
@@ -139,6 +139,6 @@ mv generated_artifacts/python3_12/cuda12-requirements.txt \
 Finally, test that the new dependencies install correctly and that MaxText runs
 as expected.
 
-1. **Install MaxText and dependencies**: For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#from-source).
+1. **Install MaxText and dependencies**: For instructions on installing MaxText on your VM, please refer to the [official documentation](../install_maxtext.md#from-source).
 
 2. **Run tests:** Run MaxText tests to ensure there are no regressions.
@@ -20,15 +20,15 @@ This documentation acts as the primary resource for efficiently integrating new
 
 ## 1. Architecture Analysis
 
-The first phase involves determining how the new model's architecture aligns with MaxText's existing capabilities. To facilitate this assessment, refer to the [MaxText architecture overview](https://maxtext.readthedocs.io/en/latest/reference/architecture/architecture_overview.html) and [list of supported models](https://maxtext.readthedocs.io/en/latest/reference/models/supported_models_and_architectures.html).
+The first phase involves determining how the new model's architecture aligns with MaxText's existing capabilities. To facilitate this assessment, refer to the [MaxText architecture overview](../reference/architecture/architecture_overview.md) and [list of supported models](../reference/models/supported_models_and_architectures.md).
 
-**Input Data Pipeline**: MaxText supports HuggingFace, Grain, and TFDS pipelines ([details](https://maxtext.readthedocs.io/en/latest/guides/data_input_pipeline.html)). While synthetic data is typically used for initial performance benchmarks, the framework supports multiple modalities including text and image (audio and video - work in progress).
+**Input Data Pipeline**: MaxText supports HuggingFace, Grain, and TFDS pipelines ([details](data_input_pipeline.md)). While synthetic data is typically used for initial performance benchmarks, the framework supports multiple modalities including text and image (audio and video - work in progress).
 
 **Tokenizer**: Supported [tokenizer options](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/input_pipeline/tokenizer.py) include `TikTokenTokenizer`, `SentencePieceTokenizer`, and `HFTokenizer`.
 
 **Self-Attention & RoPE**: Available mechanisms include optimized [Flash Attention](https://github.com/AI-Hypercomputer/maxtext/blob/62ee818144eb037ad3fe85ab8e789cd074776f46/src/maxtext/layers/attention_op.py#L1184) (supporting MHA, GQA, and MQA), Multi-head Latent Attention ([MLA](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/layers/attention_mla.py)), and [Gated Delta Network](https://github.com/AI-Hypercomputer/maxtext/blob/62ee818144eb037ad3fe85ab8e789cd074776f46/src/maxtext/models/qwen3.py#L358). MaxText also supports [Regular](https://github.com/AI-Hypercomputer/maxtext/blob/88d2ffd34c0ace76f836c7ea9c2fe4cd2d271088/MaxText/layers/embeddings.py#L108), [Llama](https://github.com/AI-Hypercomputer/maxtext/blob/88d2ffd34c0ace76f836c7ea9c2fe4cd2d271088/MaxText/layers/embeddings.py#L178), and [YaRN](https://github.com/AI-Hypercomputer/maxtext/blob/88d2ffd34c0ace76f836c7ea9c2fe4cd2d271088/MaxText/layers/embeddings.py#L282) variations of Rotary Positional Embeddings (RoPE).
 
-**Multi-Layer Perceptron (MLP)**: The framework supports both traditional dense models and Mixture of Experts (MoE) architectures, including [configurations](https://maxtext.readthedocs.io/en/latest/reference/core_concepts/moe_configuration.html) for routed and shared experts.
+**Multi-Layer Perceptron (MLP)**: The framework supports both traditional dense models and Mixture of Experts (MoE) architectures, including [configurations](../reference/core_concepts/moe_configuration.md) for routed and shared experts.
 
 **Normalization**: We support different [normalization strategies](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/layers/normalizations.py), including RMSNorm and Gated RMSNorm. These can be configured before or after attention/MLP layers.
 
@@ -44,7 +44,7 @@ This step can be bypassed if the current MaxText codebase already supports all c
 
 While most open-source models are distributed in Safetensors or PyTorch formats, MaxText requires conversion to the [Orbax](https://orbax.readthedocs.io/en/latest/) format.
 
-There are [two primary formats](https://maxtext.readthedocs.io/en/latest/reference/core_concepts/checkpoints.html) for Orbax checkpoints within MaxText, and while both are technically compatible with training and inference, we recommend following these performance-optimized guidelines:
+There are [two primary formats](../reference/core_concepts/checkpoints.md) for Orbax checkpoints within MaxText, and while both are technically compatible with training and inference, we recommend following these performance-optimized guidelines:
 
 - **Scanned Format**: Recommended for **training** as it stacks layers for efficient processing via `jax.lax.scan`. To enable this, set `scan_layers=True`.
 - **Unscanned Format**: Recommended for **inference** to simplify loading individual layer parameters. To enable this, set `scan_layers=False`.
@@ -58,7 +58,7 @@ Success starts with a clear map. You must align the parameter names from your so
 
 ### 3.2 Write Script
 
-Use existing model scripts within the repository as templates to tailor the conversion logic for your specific architecture. We strongly recommended to use the [checkpoint conversion utility](https://maxtext.readthedocs.io/en/latest/guides/checkpointing_solutions/convert_checkpoint.html) rather than [standalone scripts](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/checkpoint_conversion/standalone_scripts).
+Use existing model scripts within the repository as templates to tailor the conversion logic for your specific architecture. We strongly recommended to use the [checkpoint conversion utility](checkpointing_solutions/convert_checkpoint.md) rather than [standalone scripts](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/checkpoint_conversion/standalone_scripts).
 
 ### 3.3 Verify Compatibility
 
@@ -132,7 +132,7 @@ If you run the `forward_pass_logit_checker.py` to compare reference logits with
 
 **Q: How to compile models for a target hardware without physical access?**
 
-**A:** If you need to compile your training run ahead of time, use the train_compile.py tool. This utility allows you to compile the primary train_step for specific target hardware without needing the actual devices on hand. It’s particularly useful for verifying your implementation's functionality on a local Cloud VM or a standard CPU. Please refer [here](https://maxtext.readthedocs.io/en/latest/guides/monitoring_and_debugging/features_and_diagnostics.html#ahead-of-time-compilation-aot) for more examples.
+**A:** If you need to compile your training run ahead of time, use the train_compile.py tool. This utility allows you to compile the primary train_step for specific target hardware without needing the actual devices on hand. It’s particularly useful for verifying your implementation's functionality on a local Cloud VM or a standard CPU. Please refer [here](monitoring_and_debugging/features_and_diagnostics.md#ahead-of-time-compilation-aot) for more examples.
 
 **Q: My model is too large for my development machine. What should I do?**
 
 
@@ -86,7 +86,7 @@ To install, click the `Extensions` icon on the left sidebar (or press `Ctrl+Shif
 
 ### Step 3: Install MaxText and Dependencies
 
-To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
+To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](../install_maxtext.md#from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
 
 > **Note:** If you have previously installed MaxText with a different option (e.g., `maxtext[tpu]`), we strongly recommend using a fresh virtual environment for `maxtext[tpu-post-train]` to avoid potential library version conflicts.
 
@@ -139,7 +139,7 @@ pip3 install jupyterlab
 
 ### Step 3: Install MaxText and Dependencies
 
-To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](https://maxtext.readthedocs.io/en/latest/install_maxtext.html#from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
+To execute post-training notebooks on your TPU-VM, follow the official [MaxText installation guides](../install_maxtext.md#from-source) and specifically follow `Option 3: Installing [tpu-post-train]`. This will ensure all post-training dependencies are installed inside your virtual environment.
 
 > **Note:** If you have previously installed MaxText with a different option (e.g., `maxtext[tpu]`), we strongly recommend using a fresh virtual environment for `maxtext[tpu-post-train]` to avoid potential library version conflicts.
 
@@ -200,7 +200,7 @@ jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
 
 ## Support and Resources
 
-- 📘 [MaxText Documentation](https://maxtext.readthedocs.io/)
+- 📘 [MaxText Documentation](../index.md)
 - 💻 [Google Colab](https://colab.research.google.com)
 - ⚡ [Cloud TPU Docs](https://cloud.google.com/tpu/docs)
 - 🧩 [Jupyter Lab](https://jupyterlab.readthedocs.io)
@@ -35,7 +35,7 @@ <h3>JAX AI Stack</h3>
             <li><a href="https://optax.readthedocs.io/en/latest/">Optax</a> - For gradient processing and optimization</li>
             <li><a href="https://tunix.readthedocs.io/en/latest/">Tunix</a> - A JAX Library with the latest experimental algorithms and post-training techniques</li>
             <li><a href="https://github.com/jax-ml/ml_dtypes">ml_dtypes</a> - NumPy dtype extensions for machine learning.</li>
-            <li><a href="https://maxtext.readthedocs.io/en/latest/index.html#model-library">MaxText model library</a> for JAX LLMs highly optimized for TPUs</li>
+            <li><a href="reference/models.html">MaxText model library</a> for JAX LLMs highly optimized for TPUs</li>
             <li><a href="https://blog.vllm.ai/2025/10/16/vllm-tpu.html">vLLM on TPU</a> for high performance sampling (inference) for Reinforcement Learning (RL)</li>
             <li><a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro">Pathways</a> for multi-host inference (sampling) and highly efficient weight transfer</li>
             <li>Optional data loading libraries (<a href="https://google-grain.readthedocs.io/en/latest/">Grain</a> or <a href="https://www.tensorflow.org/guide/data">tf.data</a>)</li>
 
@@ -74,7 +74,7 @@ This is the easiest way to get started with the latest stable version.
      access to the `build_maxtext_docker_image`, `upload_maxtext_docker_image`,
      and `xpk` commands. For more details on building and uploading Docker
      images, see the
-     [Build MaxText Docker Image](https://maxtext.readthedocs.io/en/latest/build_maxtext.html)
+     [Build MaxText Docker Image](build_maxtext.md)
      guide.
 
      ```bash
 
@@ -56,7 +56,7 @@ For more information on using Orbax, please refer to https://github.com/google/o
 
 1. **Deterministic by Design**: Grain allows storing data loader states, provides strong guarantees about data ordering and sharding even with preemptions, which is critical for reproducibility.
 2. **Global Shuffle**: Prevents local overfitting.
-3. **Built for Multi-Host Training**: The using random access file format streamlines [data loading in the multi-host environments](https://maxtext.readthedocs.io/en/maxtext-v0.2.1/guides/data_input_pipeline.html#multihost-dataloading-best-practice).
+3. **Built for Multi-Host Training**: The using random access file format streamlines [data loading in the multi-host environments](../../guides/data_input_pipeline.md#multihost-dataloading-best-practice).
 
 Its APIs are explicitly designed for the multi-host paradigm, simplifying the process of ensuring that each host loads a unique shard of the global batch.
 
 
@@ -40,4 +40,4 @@ For each of the TPU platforms listed below, we present a list of optimized model
 
 \[1\]: Performance results are subject to variations based on system configuration, software versions, and other factors. These benchmarks represent point-in-time measurements under specific conditions.
 
-\[2\]: Some older TFLOPS/s results are impacted by an updated calculation for causal attention ([PR #1988](https://github.com/AI-Hypercomputer/maxtext/pull/1988)), which halves the attention FLOPs. This change particularly affects configurations with large sequence lengths. For more details, please refer to the [performance metrics guide](https://maxtext.readthedocs.io/en/latest/reference/performance_metrics.html).
+\[2\]: Some older TFLOPS/s results are impacted by an updated calculation for causal attention ([PR #1988](https://github.com/AI-Hypercomputer/maxtext/pull/1988)), which halves the attention FLOPs. This change particularly affects configurations with large sequence lengths. For more details, please refer to the [performance metrics guide](../performance_metrics.md).
Original file line number	Diff line number	Diff line change
`@@ -40,4 +40,4 @@ For each of the TPU platforms listed below, we present a list of optimized model`
`40`	`40`
`41`	`41`	`\[1\]: Performance results are subject to variations based on system configuration, software versions, and other factors. These benchmarks represent point-in-time measurements under specific conditions.`
`42`	`42`
`43`		`-\[2\]: Some older TFLOPS/s results are impacted by an updated calculation for causal attention ([PR #1988](https://github.com/AI-Hypercomputer/maxtext/pull/1988)), which halves the attention FLOPs. This change particularly affects configurations with large sequence lengths. For more details, please refer to the [performance metrics guide](https://maxtext.readthedocs.io/en/latest/reference/performance_metrics.html).`
	`43`	`+\[2\]: Some older TFLOPS/s results are impacted by an updated calculation for causal attention ([PR #1988](https://github.com/AI-Hypercomputer/maxtext/pull/1988)), which halves the attention FLOPs. This change particularly affects configurations with large sequence lengths. For more details, please refer to the [performance metrics guide](../performance_metrics.md).`