diff --git a/docs/development.md b/docs/development.md
index 19705ebee0..7237229814 100644
--- a/docs/development.md
+++ b/docs/development.md
@@ -9,13 +9,13 @@ The MaxText documentation website is built using [Sphinx](https://www.sphinx-doc
 
 If you are writing documentation for MaxText, you may want to preview the documentation site locally to ensure things work as expected before a deployment to Read The Docs.
 
-First, make sure you install the necessary dependencies. You can do this by navigating to your local clone of the MaxText repo and running:
+First, make sure you install the necessary dependencies. You can do this by navigating to your local clone of the MaxText repo, following the [local installation instructions](install_maxtext.md) and running:
 
 ```bash
-pip install -r src/dependencies/requirements/requirements_docs.txt
+uv pip install -r src/dependencies/requirements/requirements_docs.txt
 ```
 
-Once the dependencies are installed, navigate to the `docs/` directory and run the `sphinx-build`:
+Once the dependencies are installed and your `maxtext_venv` virtual environment is activated, you can navigate to the `docs/` folder and run:
 
 ```bash
 cd docs
diff --git a/docs/guides/data_input_pipeline/data_input_grain.md b/docs/guides/data_input_pipeline/data_input_grain.md
index 497f968125..5a7d66981d 100644
--- a/docs/guides/data_input_pipeline/data_input_grain.md
+++ b/docs/guides/data_input_pipeline/data_input_grain.md
@@ -109,7 +109,7 @@ Note that `FILE_PATH` is optional; when provided, the script runs `ls -R` for pr
 4. Example command:
 
 ```sh
-bash tools/setup/setup_gcsfuse.sh \
+bash src/dependencies/scripts/setup_gcsfuse.sh \
 DATASET_GCS_BUCKET=maxtext-dataset \
 MOUNT_PATH=/tmp/gcsfuse && \
 python3 -m maxtext.trainers.pre_train.train \
diff --git a/docs/guides/optimization/benchmark_and_performance.md b/docs/guides/optimization/benchmark_and_performance.md
index 858bcb9673..2f50feb644 100644
--- a/docs/guides/optimization/benchmark_and_performance.md
+++ b/docs/guides/optimization/benchmark_and_performance.md
@@ -69,7 +69,7 @@ Different quantization recipes are available, including` "int8", "fp8", "fp8_ful
 
 For v6e and earlier generation TPUs, use the "int8" recipe. For v7x and later generation TPUs, use "fp8_full". GPUs should use “fp8_gpu” for NVIDIA and "nanoo_fp8" for AMD.
 
-See [](quantization).
+See [](quantization-doc).
 
 ### Choose sharding strategy
 
@@ -98,16 +98,16 @@ There are two methods for asynchronous collective offloading:
 
 1. Offload Collectives to Sparse Core:
 
-   This method is recommended for v7x. To enable it, set the following flags from \[[link](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/xla_flags_library.py#L70)\]:
+   This method is recommended for v7x. To enable it, set the following flags from [link](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/xla_flags_library.py#L70):
 
 - `ENABLE_SPARSECORE_OFFLOADING_FOR_RS_AG_AR`
 - `ENABLE_SPARSECORE_OFFLOADING_FOR_REDUCE_SCATTER`
 - `ENABLE_SPARSECORE_OFFLOADING_FOR_ALL_GATHER`
 - `ENABLE_SPARSECORE_OFFLOADING_FOR_ALL_REDUCE`
 
-2. Overlap Collective Using Continuation Fusion:\*\*
+2. Overlap Collective Using Continuation Fusion:
 
-   This method is recommended for v5p and v6e. To enable it, set the following flags \[[link](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/xla_flags_library.py#L39)\]:
+   This method is recommended for v5p and v6e. To enable it, set the following flags ([link](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/xla_flags_library.py#L39)):
 
 - `CF_FOR_ALL_GATHER`
 - `CF_FOR_ALL_REDUCE`
diff --git a/docs/guides/optimization/custom_model.md b/docs/guides/optimization/custom_model.md
index 3ba6a1df59..991c322a99 100644
--- a/docs/guides/optimization/custom_model.md
+++ b/docs/guides/optimization/custom_model.md
@@ -85,7 +85,7 @@ Use these general runtime configurations to improve your model's performance.
 
 ## Step 3. Choose efficient sharding strategies using Roofline Analysis
 
-To achieve good performance, it's often necessary to co-design the model's dimensions (like the MLP dimension) along with the sharding strategy. We have included examples for [v5p](https://docs.cloud.google.com/tpu/docs/v5p), [Trillium](https://docs.cloud.google.com/tpu/docs/v6e), and [Ironwood](https://docs.cloud.google.com/tpu/docs/tpu7x) that demonstrate which sharding approaches work well for specific models. We recommend reading [](sharding) and Jax’s [scaling book](https://jax-ml.github.io/scaling-book/sharding/).
+To achieve good performance, it's often necessary to co-design the model's dimensions (like the MLP dimension) along with the sharding strategy. We have included examples for [v5p](https://docs.cloud.google.com/tpu/docs/v5p), [Trillium](https://docs.cloud.google.com/tpu/docs/v6e), and [Ironwood](https://docs.cloud.google.com/tpu/docs/tpu7x) that demonstrate which sharding approaches work well for specific models. We recommend reading [](sharding_on_TPUs) and Jax’s [scaling book](https://jax-ml.github.io/scaling-book/sharding/).
 
 | TPU Type | ICI Arithmetic Intensity                                                                                                                           |
 | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
diff --git a/docs/reference.md b/docs/reference.md
index 3c8d8acc6e..fe8d74faa6 100644
--- a/docs/reference.md
+++ b/docs/reference.md
@@ -18,37 +18,42 @@
 
 Deep dive into MaxText architecture, models, and core concepts.
 
-::::{grid} 1 2 2 2
-:gutter: 2
-
-:::{grid-item-card} 📊 Performance Metrics
+````{grid} 1 2 2 2
+---
+gutter: 2
+---
+```{grid-item-card} 📊 Performance Metrics
 :link: reference/performance_metrics
 :link-type: doc
 
 Understanding Model Flops Utilization (MFU), calculation methods, and why it matters for performance optimization.
-:::
+```
 
-:::{grid-item-card} 🤖 Models
+```{grid-item-card} 🤖 Models
 :link: reference/models
 :link-type: doc
 
 Supported models and architectures, including Llama, Qwen, and Mixtral. Details on tiering and new additions.
-:::
+```
 
-:::{grid-item-card} 🏗️ Architecture
+```{grid-item-card} 🏗️ Architecture
 :link: reference/architecture
 :link-type: doc
 
 High-level overview of MaxText design, JAX/XLA choices, and how components interact.
-:::
+```
 
-:::{grid-item-card} 💡 Core Concepts
+```{grid-item-card} 💡 Core Concepts
 :link: reference/core_concepts
 :link-type: doc
 
 Key concepts including checkpointing strategies, quantization, tiling, and Mixture of Experts (MoE) configuration.
-:::
-::::
+```
+````
+
+## 📚 API Reference
+
+Find comprehensive API documentation for MaxText modules, classes, and functions in the [API Reference page](reference/api.rst).
 
 ```{toctree}
 ---
@@ -59,4 +64,5 @@ reference/performance_metrics
 reference/models
 reference/architecture
 reference/core_concepts
+reference/api.rst
 ```
diff --git a/docs/reference/core_concepts/quantization.md b/docs/reference/core_concepts/quantization.md
index 6f72da9ea9..dae117a85a 100644
--- a/docs/reference/core_concepts/quantization.md
+++ b/docs/reference/core_concepts/quantization.md
@@ -14,7 +14,7 @@
  limitations under the License.
  -->
 
-(quantization)=
+(quantization-doc)=
 
 # Quantization