Fix broken links and formatting for documentation

melissawm · melissawm · commit 99dbb2fce70e · 2026-04-02T16:51:55.000-03:00
Also adds API documentation to ToC.

Fix path to setup_gcsfuse.sh
diff --git a/docs/development.md b/docs/development.md
@@ -9,13 +9,13 @@ The MaxText documentation website is built using [Sphinx](https://www.sphinx-doc
 
 If you are writing documentation for MaxText, you may want to preview the documentation site locally to ensure things work as expected before a deployment to Read The Docs.
 
-First, make sure you install the necessary dependencies. You can do this by navigating to your local clone of the MaxText repo and running:
+First, make sure you install the necessary dependencies. You can do this by navigating to your local clone of the MaxText repo, following the [local installation instructions](install_maxtext.md) and running:
 
 ```bash
-pip install -r src/dependencies/requirements/requirements_docs.txt
+uv pip install -r src/dependencies/requirements/requirements_docs.txt
 ```
 
-Once the dependencies are installed, navigate to the `docs/` directory and run the `sphinx-build`:
+Once the dependencies are installed and your `maxtext_venv` virtual environment is activated, you can navigate to the `docs/` folder and run:
 
 ```bash
 cd docs
diff --git a/docs/guides/data_input_pipeline/data_input_grain.md b/docs/guides/data_input_pipeline/data_input_grain.md
@@ -34,10 +34,10 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state
 
 1. Grain currently supports three data formats: [ArrayRecord](https://github.com/google/array_record) (random access), [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups) and [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord)(sequential access). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources/protocol.html) class.
    - **Community Resource**: The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet.
-2. If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh). The script configures some parameters for the mount.
+2. If the dataset is hosted on a Cloud Storage bucket, the path `gs://` can be provided directly. However, for the best performance, it's recommended to read the bucket through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). This will significantly improve the perf for the ArrayRecord format as it allows meta data caching to speeds up random access. The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh). The script configures some parameters for the mount.
 
 ```sh
-bash tools/setup/setup_gcsfuse.sh \
+bash src/dependencies/scripts/setup_gcsfuse.sh \
 DATASET_GCS_BUCKET=${BUCKET_NAME?} \
 MOUNT_PATH=${MOUNT_PATH?} \
 [FILE_PATH=${MOUNT_PATH?}/my_dataset]
@@ -47,7 +47,7 @@ Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pr
 
 1. Set `dataset_type=grain`, `grain_file_type={arrayrecord|parquet|tfrecord}`, `grain_train_files` in `src/maxtext/configs/base.yml` or through command line arguments to match the file pattern on the mounted local path.
 
-2. Tune `grain_worker_count` for performance. This parameter controls the number of child processes used by Grain (more details in [behind_the_scenes](https://google-grain.readthedocs.io/en/latest/behind_the_scenes.html)). If you use a large number of workers, check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh) to avoid gcsfuse throttling.
+2. Tune `grain_worker_count` for performance. This parameter controls the number of child processes used by Grain (more details in [behind_the_scenes](https://google-grain.readthedocs.io/en/latest/behind_the_scenes.html)). If you use a large number of workers, check your config for gcsfuse in [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh) to avoid gcsfuse throttling.
 
 3. ArrayRecord Only: For multi-source blending, you can specify multiple data sources with their respective weights using semicolon (;) as a separator and a comma (,) for weights. The weights will be automatically normalized to sum to 1.0. For example:
 
@@ -109,7 +109,7 @@ Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pr
 4. Example command:
 
 ```sh
-bash tools/setup/setup_gcsfuse.sh \
+bash src/dependencies/scripts/setup_gcsfuse.sh \
 DATASET_GCS_BUCKET=maxtext-dataset \
 MOUNT_PATH=/tmp/gcsfuse && \
 python3 -m maxtext.trainers.pre_train.train \
diff --git a/docs/guides/optimization/benchmark_and_performance.md b/docs/guides/optimization/benchmark_and_performance.md
@@ -69,7 +69,7 @@ Different quantization recipes are available, including` "int8", "fp8", "fp8_ful
 
 For v6e and earlier generation TPUs, use the "int8" recipe. For v7x and later generation TPUs, use "fp8_full". GPUs should use “fp8_gpu” for NVIDIA and "nanoo_fp8" for AMD.
 
-See [](quantization).
+See [](quantization-doc).
 
 ### Choose sharding strategy
 
@@ -98,16 +98,16 @@ There are two methods for asynchronous collective offloading:
 
 1. Offload Collectives to Sparse Core:
 
-   This method is recommended for v7x. To enable it, set the following flags from \[[link](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/xla_flags_library.py#L70)\]:
+   This method is recommended for v7x. To enable it, set the following flags from [link](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/xla_flags_library.py#L70):
 
 - `ENABLE_SPARSECORE_OFFLOADING_FOR_RS_AG_AR`
 - `ENABLE_SPARSECORE_OFFLOADING_FOR_REDUCE_SCATTER`
 - `ENABLE_SPARSECORE_OFFLOADING_FOR_ALL_GATHER`
 - `ENABLE_SPARSECORE_OFFLOADING_FOR_ALL_REDUCE`
 
-2. Overlap Collective Using Continuation Fusion:\*\*
+2. Overlap Collective Using Continuation Fusion:
 
-   This method is recommended for v5p and v6e. To enable it, set the following flags \[[link](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/xla_flags_library.py#L39)\]:
+   This method is recommended for v5p and v6e. To enable it, set the following flags ([link](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/xla_flags_library.py#L39)):
 
 - `CF_FOR_ALL_GATHER`
 - `CF_FOR_ALL_REDUCE`
diff --git a/docs/guides/optimization/custom_model.md b/docs/guides/optimization/custom_model.md
@@ -85,7 +85,7 @@ Use these general runtime configurations to improve your model's performance.
 
 ## Step 3. Choose efficient sharding strategies using Roofline Analysis
 
-To achieve good performance, it's often necessary to co-design the model's dimensions (like the MLP dimension) along with the sharding strategy. We have included examples for [v5p](https://docs.cloud.google.com/tpu/docs/v5p), [Trillium](https://docs.cloud.google.com/tpu/docs/v6e), and [Ironwood](https://docs.cloud.google.com/tpu/docs/tpu7x) that demonstrate which sharding approaches work well for specific models. We recommend reading [](sharding) and Jax’s [scaling book](https://jax-ml.github.io/scaling-book/sharding/).
+To achieve good performance, it's often necessary to co-design the model's dimensions (like the MLP dimension) along with the sharding strategy. We have included examples for [v5p](https://docs.cloud.google.com/tpu/docs/v5p), [Trillium](https://docs.cloud.google.com/tpu/docs/v6e), and [Ironwood](https://docs.cloud.google.com/tpu/docs/tpu7x) that demonstrate which sharding approaches work well for specific models. We recommend reading [](sharding_on_TPUs) and Jax’s [scaling book](https://jax-ml.github.io/scaling-book/sharding/).
 
 | TPU Type | ICI Arithmetic Intensity                                                                                                                           |
 | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
diff --git a/docs/reference.md b/docs/reference.md
@@ -18,37 +18,42 @@
 
 Deep dive into MaxText architecture, models, and core concepts.
 
-::::\{grid} 1 2 2 2
-:gutter: 2
-
-:::\{grid-item-card} 📊 Performance Metrics
+````{grid} 1 2 2 2
+---
+gutter: 2
+---
+```{grid-item-card} 📊 Performance Metrics
 :link: reference/performance_metrics
 :link-type: doc
 
 Understanding Model Flops Utilization (MFU), calculation methods, and why it matters for performance optimization.
-:::
+```
 
-:::\{grid-item-card} 🤖 Models
+```{grid-item-card} 🤖 Models
 :link: reference/models
 :link-type: doc
 
 Supported models and architectures, including Llama, Qwen, and Mixtral. Details on tiering and new additions.
-:::
+```
 
-:::\{grid-item-card} 🏗️ Architecture
+```{grid-item-card} 🏗️ Architecture
 :link: reference/architecture
 :link-type: doc
 
 High-level overview of MaxText design, JAX/XLA choices, and how components interact.
-:::
+```
 
-:::\{grid-item-card} 💡 Core Concepts
+```{grid-item-card} 💡 Core Concepts
 :link: reference/core_concepts
 :link-type: doc
 
 Key concepts including checkpointing strategies, quantization, tiling, and Mixture of Experts (MoE) configuration.
-:::
-::::
+```
+````
+
+## 📚 API Reference
+
+Find comprehensive API documentation for MaxText modules, classes, and functions in the [API Reference page](reference/api.rst).
 
 ```{toctree}
 ---
@@ -59,4 +64,5 @@ reference/performance_metrics
 reference/models
 reference/architecture
 reference/core_concepts
+reference/api.rst
 ```
diff --git a/docs/reference/core_concepts/quantization.md b/docs/reference/core_concepts/quantization.md
@@ -14,7 +14,7 @@
  limitations under the License.
  -->
 
-(quantization)=
+(quantization-doc)=
 
 # Quantization
 
diff --git a/docs/tutorials/pretraining.md b/docs/tutorials/pretraining.md
@@ -87,7 +87,7 @@ eval metrics after step: 9, loss=9.420, total_weights=75264.0
 
 Grain is a library for reading data for training and evaluating JAX models. It is the recommended input pipeline for determinism and resilience! It supports data formats like ArrayRecord and Parquet. You can check [Grain pipeline](../guides/data_input_pipeline/data_input_grain.md) for more details.
 
-**Data preparation**: You need to download data to a Cloud Storage bucket, and read data via Cloud Storage Fuse with [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/tools/setup/setup_gcsfuse.sh).
+**Data preparation**: You need to download data to a Cloud Storage bucket, and read data via Cloud Storage Fuse with [setup_gcsfuse.sh](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/dependencies/scripts/setup_gcsfuse.sh).
 
 - For example, we can mount the bucket `gs://maxtext-dataset` on the local path `/tmp/gcsfuse` before training
   ```bash