docs & feat: clarify and improve scan_layers mismatch error handling in conversion, training and checkpoints

shralex · shralex · commit f83391489ec3 · 2026-06-07T01:18:53.000Z
diff --git a/docs/development.md b/docs/development.md
@@ -7,4 +7,5 @@ hidden:
 ---
 development/update_dependencies.md
 development/contribute_docs.md
+development/hlo_diff_testing.md
 ```
diff --git a/docs/guides/checkpointing_solutions/convert_checkpoint.md b/docs/guides/checkpointing_solutions/convert_checkpoint.md
@@ -70,7 +70,7 @@ You can find your converted checkpoint files under `${BASE_OUTPUT_DIRECTORY}/0/i
 ### Key Parameters
 
 - `model_name`: The specific model identifier. It must match a supported entry in the MaxText [globals.py](https://github.com/AI-Hypercomputer/maxtext/blob/16b684840db9b96b19e24e84ac49f06af7204ae3/src/maxtext/utils/globals.py#L46C1-L46C7).
-- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information.
+- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information. **IMPORTANT:** This setting *must* match the `scan_layers` value used during model training or loading. A mismatch will cause PyTree loading errors (though MaxText will intercept these and raise a descriptive `ValueError` explaining the mismatch).
 - `use_multimodal`: Indicates if multimodality is used, important for Gemma3.
 - `base_output_directory`: The path where the converted Orbax checkpoint will be stored; it can be Google Cloud Storage (GCS) or local.
 - `hardware=cpu`: The conversion script runs on a CPU machine.
@@ -239,7 +239,10 @@ Here is an example [PR to add support for gemma3 multi-modal model](https://gith
 
 ### Common Errors
 
-- "Type ShapeDtypeStruct is not a valid JAX type": Usually caused by a mismatch in the `scan_layers` flag.
+- "Type ShapeDtypeStruct is not a valid JAX type" or generic **PyTree structure/shape mismatches** (e.g., Orbax reporting `"X/Y paths matched"`, such as `143/145 paths`):
+  This is almost always caused by a mismatch in the `scan_layers` configuration between the checkpoint conversion script (e.g., `to_maxtext.py` or `to_huggingface.py`) and the trainer/inference runner (e.g., `train.py`).
+
+  - **Solution:** Ensure the `scan_layers` flag is set to the exact same value (`True` or `False`) in both the conversion command and your training/execution command.
 
 - If the converted checkpoint loads without errors but produces nonsensical output, likely an error in the Q/K/V weight reshaping logic during conversion.
 
diff --git a/docs/guides/data_input_pipeline.md b/docs/guides/data_input_pipeline.md
@@ -64,5 +64,6 @@ hidden:
 data_input_pipeline/data_input_grain
 data_input_pipeline/data_input_hf
 data_input_pipeline/data_input_tfds
+data_input_pipeline/olmo_grain
 data_input_pipeline/data_pipeline_perf.md
 ```
diff --git a/docs/guides/optimization.md b/docs/guides/optimization.md
@@ -18,37 +18,39 @@
 
 Explore techniques for maximizing performance, including model customization, sharding strategies, Pallas kernels, and benchmarking.
 
-::::{grid} 1 2 2 2
-:gutter: 2
+````{grid} 1 2 2 2
+---
+gutter: 2
+---
 
-:::{grid-item-card} 🛠️ Customizing Model Configs
+```{grid-item-card} 🛠️ Customizing Model Configs
 :link: optimization/custom_model
 :link-type: doc
 
 Optimize and customize your LLM model configurations for higher performance (MFU) on TPUs.
-:::
+```
 
-:::{grid-item-card} 🥞 Sharding Strategies
+```{grid-item-card} 🥞 Sharding Strategies
 :link: optimization/sharding
 :link-type: doc
 
 Choose efficient sharding strategies (FSDP, TP, EP, PP) using Roofline Analysis and understand arithmetic intensity.
-:::
+```
 
-:::{grid-item-card} ⚡ Pallas Kernels
+```{grid-item-card} ⚡ Pallas Kernels
 :link: optimization/pallas_kernels_performance
 :link-type: doc
 
 Optimize with Pallas kernels for fine-grained control.
-:::
+```
 
-:::{grid-item-card} 📈 Benchmarking & Tuning
+```{grid-item-card} 📈 Benchmarking & Tuning
 :link: optimization/benchmark_and_performance
 :link-type: doc
 
 Guide to setting up benchmarks, performing performance tuning, and analyzing metrics.
-:::
-::::
+```
+````
 
 ```{toctree}
 ---
@@ -57,6 +59,7 @@ maxdepth: 1
 ---
 optimization/custom_model.md
 optimization/sharding.md
+optimization/custom_mesh_and_rule.md
 optimization/pallas_kernels_performance.md
 optimization/benchmark_and_performance.md
 ```
diff --git a/docs/reference/core_concepts/checkpoints.md b/docs/reference/core_concepts/checkpoints.md
@@ -66,6 +66,14 @@ Their difference can also be represented in the following pytree structure:
 
 The stacked format is highly efficient but has one key requirement: all layers within the `scan` operation must have identical configurations. For models with heterogeneous layers (where layer configurations differ), stacking is not possible, and only unstacked checkpoints can be used.
 
+In MaxText, the **`scan_layers`** configuration parameter is used to control this setting:
+
+- `scan_layers=true` tells MaxText to stack layer parameters (recommended for training).
+- `scan_layers=false` tells MaxText to keep layer parameters unstacked (often required for inference and certain model architectures).
+
+> [!IMPORTANT]
+> **PyTree Structure Compatibility:** Because JAX expects the loaded PyTree structure to exactly match the model's instantiated structure, the value of the `scan_layers` flag during execution (training, SFT, RL, DPO, or decoding) **must** match the format of the checkpoint being loaded. A mismatch will cause PyTree loading or shape/path mismatch errors (which MaxText will intercept to raise a descriptive `ValueError` pointing to the scan_layers setting).
+
 ### Takeaways
 
 To summarize the four checkpoint types:
diff --git a/docs/run_maxtext/run_maxtext_localhost.md b/docs/run_maxtext/run_maxtext_localhost.md
@@ -65,6 +65,9 @@ python3 -m maxtext.inference.decode \
 
 **Note:** Because the model hasn't been properly trained, the output text will be random. To generate meaningful output, you need to load a trained checkpoint using the `load_parameters_path` argument.
 
+> [!NOTE]
+> **Checkpoints & `scan_layers` compatibility:** When loading an external or converted checkpoint via `load_parameters_path`, the `scan_layers` setting in your command **must** match the setting used to save the checkpoint. If the checkpoint was saved/converted with `scan_layers=False` (common for Hugging Face conversions and inference runs), you must specify `scan_layers=False` in your command. Otherwise, JAX/Orbax will raise PyTree structure mismatch errors.
+
 ### Running models using provided configs
 
 MaxText provides many OSS model configs that you can use directly to run training jobs on those model-specific architectures. These model-specific YAML files are located in `src/maxtext/configs/models` for TPU-oriented defaults, and `src/maxtext/configs/models/gpu` for GPU-oriented defaults.
diff --git a/docs/tutorials/posttraining/dpo.md b/docs/tutorials/posttraining/dpo.md
@@ -103,6 +103,14 @@ Refer to the steps in [Hugging Face to MaxText](../../guides/checkpointing_solut
 export MAXTEXT_CKPT_PATH=<CKPT_PATH> # e.g., gs://my-bucket/my-model-checkpoint/0/items
 ```
 
+> [!IMPORTANT]
+> **Matching the `scan_layers` Parameter:**
+> The `scan_layers` setting during your fine-tuning run **must match** the setting used when creating the checkpoint at `MAXTEXT_CKPT_PATH`.
+>
+> - If the checkpoint was converted or saved with `scan_layers=False` (which is common for Hugging Face conversions and inference-ready models), you **must also provide `scan_layers=False` in the MaxText command.**
+> - If `scan_layers` does not match, MaxText will raise a `ValueError`.
+>   See the [Checkpoints concept guide](../../reference/core_concepts/checkpoints.md) for more details.
+
 ## Running DPO Training
 
 You can run the DPO training using the specialized post-training script:
diff --git a/docs/tutorials/posttraining/rl_on_multi_host.md b/docs/tutorials/posttraining/rl_on_multi_host.md
@@ -148,6 +148,14 @@ Refer to the steps in [Hugging Face to MaxText](../../guides/checkpointing_solut
 export MAXTEXT_CKPT_PATH=<CKPT_PATH> # e.g., gs://my-bucket/my-model-checkpoint/0/items
 ```
 
+> [!IMPORTANT]
+> **Matching the `scan_layers` Parameter:**
+> The `scan_layers` setting during your RL training run **must match** the setting used when creating the checkpoint at `MAXTEXT_CKPT_PATH`.
+>
+> - If the checkpoint was converted or saved with `scan_layers=False` (which is common for Hugging Face conversions and inference-ready models), you **must also provide `scan_layers=False` in the MaxText command.**
+> - If `scan_layers` does not match, MaxText will raise a `ValueError`.
+>   See the [Checkpoints concept guide](../../reference/core_concepts/checkpoints.md) for more details.
+
 ## Submit your RL workload via Pathways
 
 See the **Troubleshooting** section for concise instructions on how to retry or
diff --git a/docs/tutorials/posttraining/sft.md b/docs/tutorials/posttraining/sft.md
@@ -88,6 +88,14 @@ Refer the steps in [Hugging Face to MaxText](../../guides/checkpointing_solution
 export MAXTEXT_CKPT_PATH=<CKPT_PATH> # e.g., gs://my-bucket/my-model-checkpoint/0/items
 ```
 
+> [!IMPORTANT]
+> **Matching the `scan_layers` Parameter:**
+> The `scan_layers` setting during your fine-tuning run **must match** the setting used when creating the checkpoint at `MAXTEXT_CKPT_PATH`.
+>
+> - If the checkpoint was converted or saved with `scan_layers=False` (which is common for Hugging Face conversions and inference-ready models), you **must also provide `scan_layers=False` in the MaxText command.**
+> - If `scan_layers` does not match, MaxText will raise a `ValueError`.
+>   See the [Checkpoints concept guide](../../reference/core_concepts/checkpoints.md) for more details.
+
 ## Run SFT on Hugging Face Dataset
 
 Now you are ready to run SFT using the following command:
diff --git a/docs/tutorials/posttraining/sft_on_multi_host.md b/docs/tutorials/posttraining/sft_on_multi_host.md
@@ -139,6 +139,14 @@ Refer the steps in [Hugging Face to MaxText](../../guides/checkpointing_solution
 export MAXTEXT_CKPT_PATH=<CKPT_PATH> # gs://my-bucket/my-checkpoint-directory/0/items
 ```
 
+> [!IMPORTANT]
+> **Matching the `scan_layers` Parameter:**
+> The `scan_layers` setting during your fine-tuning run **must match** the setting used when creating the checkpoint at `MAXTEXT_CKPT_PATH`.
+>
+> - If the checkpoint was converted or saved with `scan_layers=False` (which is common for Hugging Face conversions and inference-ready models), you **must also provide `scan_layers=False` in the MaxText command.**
+> - If `scan_layers` does not match, MaxText will raise a `ValueError`.
+>   See the [Checkpoints concept guide](../../reference/core_concepts/checkpoints.md) for more details.
+
 ## Submit workload on GKE cluster
 
 This section provides the command to run SFT on a GKE cluster.
diff --git a/src/maxtext/common/checkpointing.py b/src/maxtext/common/checkpointing.py
diff --git a/src/maxtext/utils/model_creation_utils.py b/src/maxtext/utils/model_creation_utils.py
diff --git a/tests/integration/checkpointing_test.py b/tests/integration/checkpointing_test.py
diff --git a/tests/unit/checkpointing_nnx_load_test.py b/tests/unit/checkpointing_nnx_load_test.py