docs & feat: clarify and improve scan_layers mismatch error handling in conversion, training and checkpoints

shralex · shralex · commit c8325375fa27 · 2026-06-06T19:47:02.000Z
diff --git a/docs/guides/checkpointing_solutions/convert_checkpoint.md b/docs/guides/checkpointing_solutions/convert_checkpoint.md
@@ -70,7 +70,7 @@ You can find your converted checkpoint files under `${BASE_OUTPUT_DIRECTORY}/0/i
 ### Key Parameters
 
 - `model_name`: The specific model identifier. It must match a supported entry in the MaxText [globals.py](https://github.com/AI-Hypercomputer/maxtext/blob/16b684840db9b96b19e24e84ac49f06af7204ae3/src/maxtext/utils/globals.py#L46C1-L46C7).
-- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information.
+- `scan_layers`: Controls whether the output uses a scanned (`scan_layers=true`) or unscanned (`scan_layers=false`) checkpoint format. Refer [here](../../reference/core_concepts/checkpoints.md) for more information. **IMPORTANT:** This setting *must* match the `scan_layers` value used during model training or loading. A mismatch will cause PyTree loading errors (though MaxText will intercept these and raise a descriptive `ValueError` explaining the mismatch).
 - `use_multimodal`: Indicates if multimodality is used, important for Gemma3.
 - `base_output_directory`: The path where the converted Orbax checkpoint will be stored; it can be Google Cloud Storage (GCS) or local.
 - `hardware=cpu`: The conversion script runs on a CPU machine.
@@ -239,7 +239,10 @@ Here is an example [PR to add support for gemma3 multi-modal model](https://gith
 
 ### Common Errors
 
-- "Type ShapeDtypeStruct is not a valid JAX type": Usually caused by a mismatch in the `scan_layers` flag.
+- "Type ShapeDtypeStruct is not a valid JAX type" or generic **PyTree structure/shape mismatches** (e.g., Orbax reporting `"X/Y paths matched"`, such as `143/145 paths`):
+  This is almost always caused by a mismatch in the `scan_layers` configuration between the checkpoint conversion script (e.g., `to_maxtext.py` or `to_huggingface.py`) and the trainer/inference runner (e.g., `train.py`).
+  
+  * **Solution:** Ensure the `scan_layers` flag is set to the exact same value (`True` or `False`) in both the conversion command and your training/execution command.
 
 - If the converted checkpoint loads without errors but produces nonsensical output, likely an error in the Q/K/V weight reshaping logic during conversion.
 
diff --git a/docs/reference/core_concepts/checkpoints.md b/docs/reference/core_concepts/checkpoints.md
@@ -66,6 +66,13 @@ Their difference can also be represented in the following pytree structure:
 
 The stacked format is highly efficient but has one key requirement: all layers within the `scan` operation must have identical configurations. For models with heterogeneous layers (where layer configurations differ), stacking is not possible, and only unstacked checkpoints can be used.
 
+In MaxText, the **`scan_layers`** configuration parameter is used to control this setting:
+- `scan_layers=true` tells MaxText to stack layer parameters (recommended for training).
+- `scan_layers=false` tells MaxText to keep layer parameters unstacked (often required for inference and certain model architectures).
+
+> [!IMPORTANT]
+> **PyTree Structure Compatibility:** Because JAX expects the loaded PyTree structure to exactly match the model's instantiated structure, the value of the `scan_layers` flag during execution (training, SFT, RL, DPO, or decoding) **must** match the format of the checkpoint being loaded. A mismatch will cause PyTree loading or shape/path mismatch errors (which MaxText will intercept to raise a descriptive `ValueError` pointing to the scan_layers setting).
+
 ### Takeaways
 
 To summarize the four checkpoint types:
diff --git a/docs/run_maxtext/run_maxtext_localhost.md b/docs/run_maxtext/run_maxtext_localhost.md
@@ -65,6 +65,9 @@ python3 -m maxtext.inference.decode \
 
 **Note:** Because the model hasn't been properly trained, the output text will be random. To generate meaningful output, you need to load a trained checkpoint using the `load_parameters_path` argument.
 
+> [!NOTE]
+> **Checkpoints & `scan_layers` compatibility:** When loading an external or converted checkpoint via `load_parameters_path`, the `scan_layers` setting in your command **must** match the setting used to save the checkpoint. If the checkpoint was saved/converted with `scan_layers=False` (common for Hugging Face conversions and inference runs), you must specify `scan_layers=False` in your command. Otherwise, JAX/Orbax will raise PyTree structure mismatch errors.
+
 ### Running models using provided configs
 
 MaxText provides many OSS model configs that you can use directly to run training jobs on those model-specific architectures. These model-specific YAML files are located in `src/maxtext/configs/models` for TPU-oriented defaults, and `src/maxtext/configs/models/gpu` for GPU-oriented defaults.
diff --git a/docs/tutorials/posttraining/dpo.md b/docs/tutorials/posttraining/dpo.md
@@ -103,6 +103,13 @@ Refer to the steps in [Hugging Face to MaxText](../../guides/checkpointing_solut
 export MAXTEXT_CKPT_PATH=<CKPT_PATH> # e.g., gs://my-bucket/my-model-checkpoint/0/items
 ```
 
+> [!IMPORTANT]
+> **Matching the `scan_layers` Parameter:**
+> The `scan_layers` setting during your fine-tuning run **must match** the setting used when creating the checkpoint at `MAXTEXT_CKPT_PATH`.
+> * If the checkpoint was converted or saved with `scan_layers=False` (which is common for Hugging Face conversions and inference-ready models), you **must also provide `scan_layers=False` in the MaxText command.**
+> * If `scan_layers` does not match, MaxText will raise a `ValueError`.
+> See the [Checkpoints concept guide](../../reference/core_concepts/checkpoints.md) for more details.
+
 ## Running DPO Training
 
 You can run the DPO training using the specialized post-training script:
diff --git a/docs/tutorials/posttraining/rl_on_multi_host.md b/docs/tutorials/posttraining/rl_on_multi_host.md
@@ -148,6 +148,13 @@ Refer to the steps in [Hugging Face to MaxText](../../guides/checkpointing_solut
 export MAXTEXT_CKPT_PATH=<CKPT_PATH> # e.g., gs://my-bucket/my-model-checkpoint/0/items
 ```
 
+> [!IMPORTANT]
+> **Matching the `scan_layers` Parameter:**
+> The `scan_layers` setting during your RL training run **must match** the setting used when creating the checkpoint at `MAXTEXT_CKPT_PATH`.
+> * If the checkpoint was converted or saved with `scan_layers=False` (which is common for Hugging Face conversions and inference-ready models), you **must also provide `scan_layers=False` in the MaxText command.**
+> * If `scan_layers` does not match, MaxText will raise a `ValueError`.
+> See the [Checkpoints concept guide](../../reference/core_concepts/checkpoints.md) for more details.
+
 ## Submit your RL workload via Pathways
 
 See the **Troubleshooting** section for concise instructions on how to retry or
diff --git a/docs/tutorials/posttraining/sft.md b/docs/tutorials/posttraining/sft.md
@@ -88,6 +88,13 @@ Refer the steps in [Hugging Face to MaxText](../../guides/checkpointing_solution
 export MAXTEXT_CKPT_PATH=<CKPT_PATH> # e.g., gs://my-bucket/my-model-checkpoint/0/items
 ```
 
+> [!IMPORTANT]
+> **Matching the `scan_layers` Parameter:**
+> The `scan_layers` setting during your fine-tuning run **must match** the setting used when creating the checkpoint at `MAXTEXT_CKPT_PATH`.
+> * If the checkpoint was converted or saved with `scan_layers=False` (which is common for Hugging Face conversions and inference-ready models), you **must also provide `scan_layers=False` in the MaxText command.**
+> * If `scan_layers` does not match, MaxText will raise a `ValueError`.
+> See the [Checkpoints concept guide](../../reference/core_concepts/checkpoints.md) for more details.
+
 ## Run SFT on Hugging Face Dataset
 
 Now you are ready to run SFT using the following command:
diff --git a/docs/tutorials/posttraining/sft_on_multi_host.md b/docs/tutorials/posttraining/sft_on_multi_host.md
@@ -139,6 +139,13 @@ Refer the steps in [Hugging Face to MaxText](../../guides/checkpointing_solution
 export MAXTEXT_CKPT_PATH=<CKPT_PATH> # gs://my-bucket/my-checkpoint-directory/0/items
 ```
 
+> [!IMPORTANT]
+> **Matching the `scan_layers` Parameter:**
+> The `scan_layers` setting during your fine-tuning run **must match** the setting used when creating the checkpoint at `MAXTEXT_CKPT_PATH`.
+> * If the checkpoint was converted or saved with `scan_layers=False` (which is common for Hugging Face conversions and inference-ready models), you **must also provide `scan_layers=False` in the MaxText command.**
+> * If `scan_layers` does not match, MaxText will raise a `ValueError`.
+> See the [Checkpoints concept guide](../../reference/core_concepts/checkpoints.md) for more details.
+
 ## Submit workload on GKE cluster
 
 This section provides the command to run SFT on a GKE cluster.
diff --git a/src/maxtext/common/checkpointing.py b/src/maxtext/common/checkpointing.py
@@ -18,6 +18,7 @@
 from typing import Any, Optional
 
 from absl import flags
+import contextlib
 import datetime
 from etils import epath
 from flax import nnx
@@ -639,12 +640,14 @@ def _restore_grain_iterator(
   if isinstance(data_iterator, RemoteIteratorWrapper):
     grain_restore_args = GrainCheckpointRestore(item=data_iterator)
     restored_state = checkpoint_manager.restore(step, args=Composite(items=checkpoint_args, iter=grain_restore_args))
+    _assert_no_shaped_dtype_struct(restored_state)
     return (restored_state, None)
 
   # ElasticIterator: one shared `process_0.json` regardless of shard count.
   if not isinstance(data_iterator, list) and isinstance(data_iterator.local_iterator, ElasticIterator):
     grain_restore_args = GrainCheckpointRestore(item=data_iterator.local_iterator)
     restored_state = checkpoint_manager.restore(step, args=Composite(items=checkpoint_args, iter=grain_restore_args))
+    _assert_no_shaped_dtype_struct(restored_state)
     return (restored_state, None)
 
   directory = checkpoint_manager.directory / str(step) / "iter"
@@ -693,9 +696,68 @@ def _restore_grain_iterator(
 
   # Call restore once with the composed arguments
   restored_state = checkpoint_manager.restore(step, args=Composite(items=checkpoint_args, iter=grain_restore_args))
+  _assert_no_shaped_dtype_struct(restored_state)
   return (restored_state, None)
 
 
+def _is_structural_or_shape_mismatch(e: Exception) -> bool:
+  """Helper to check if an exception is likely a PyTree structure or shape mismatch."""
+  if not isinstance(e, (ValueError, TypeError)):
+    return False
+  msg = str(e).lower()
+  mismatch_keywords = [
+      "mismatch",
+      "structure",
+      "shape",
+      "tree",
+      "leaf",
+      "leaves",
+      "paths matched",
+      "shapedtypestruct",
+      "invalid type",
+  ]
+  return any(kw in msg for kw in mismatch_keywords)
+
+
+def _assert_no_shaped_dtype_struct(pytree):
+  """Asserts that there are no jax.ShapeDtypeStruct leaves in the restored pytree."""
+  if isinstance(pytree, jax.ShapeDtypeStruct):
+    raise ValueError(
+        f"Some parameters in the restored state remained as ShapeDtypeStruct: {pytree}. "
+        "This indicates a structural mismatch between the checkpoint and the model configuration. "
+        "Usually this is due to 'scan_layers' configuration mismatch."
+    )
+
+  if hasattr(pytree, "keys") and hasattr(pytree, "__getitem__"):
+    for k in pytree.keys():
+      _assert_no_shaped_dtype_struct(pytree[k])
+  elif isinstance(pytree, (list, tuple)):
+    for v in pytree:
+      _assert_no_shaped_dtype_struct(v)
+  else:
+    leaves = jax.tree_util.tree_leaves(pytree)
+    if len(leaves) == 1 and leaves[0] is pytree:
+      return
+    for leaf in leaves:
+      _assert_no_shaped_dtype_struct(leaf)
+
+
+@contextlib.contextmanager
+def _handle_checkpoint_mismatch(context_name: str, path: str):
+  """Context manager to intercept PyTree/shape mismatches and raise descriptive errors."""
+  try:
+    yield
+  except Exception as e:
+    if _is_structural_or_shape_mismatch(e):
+      raise ValueError(
+          f"Failed to {context_name} from {path}. "
+          "This is often caused by a mismatch in the 'scan_layers' configuration "
+          "(stacked vs unstacked) between your current execution command and "
+          f"the saved checkpoint. Original error: {e}"
+      ) from e
+    raise
+
+
 def load_state_if_possible(
     checkpoint_manager: CheckpointManager | None,
     data_iterator: MultiHostDataLoadIterator | list[MultiHostDataLoadIterator] | None,
@@ -777,13 +839,15 @@ def map_to_pspec(data):
           (EmergencyCheckpointManager, EmergencyReplicatorCheckpointManager),
       ):
         checkpoint_path = str(checkpoint_manager.directory / str(step) / "items")
-        restored_nnx = _load_linen_checkpoint_into_nnx(
-            checkpoint_path,
-            abstract_unboxed_pre_state,
-            checkpoint_storage_concurrent_gb,
-            use_ocdbt,
-            use_zarr3,
-        )
+        with _handle_checkpoint_mismatch("restore NNX checkpoint", checkpoint_path):
+          restored_nnx = _load_linen_checkpoint_into_nnx(
+              checkpoint_path,
+              abstract_unboxed_pre_state,
+              checkpoint_storage_concurrent_gb,
+              use_ocdbt,
+              use_zarr3,
+          )
+          _assert_no_shaped_dtype_struct(restored_nnx)
         return ({"items": restored_nnx}, None)
 
       # Convert nnx.State to pure dict to match how checkpoints are saved for NNX
@@ -798,64 +862,74 @@ def map_to_pspec(data):
           partial_restore=True,
       )
 
-      match (checkpoint_manager, dataset_type, data_iterator):
-        # Case 1: Matches if 'checkpoint_manager' is an instance of either EmergencyCheckpointManager
-        # or EmergencyReplicatorCheckpointManager. The '_' indicates that 'dataset_type' and
-        # 'data_iterator' can be any value and aren't used in this pattern.
-        case (checkpoint_manager, _, _) if isinstance(
-            checkpoint_manager,
-            (EmergencyCheckpointManager, EmergencyReplicatorCheckpointManager),
-        ):
-          return (
-              checkpoint_manager.restore(step, args=Composite(state=checkpoint_args)).state,
-              None,
-          )
-        # Case 2: Matches if dataset type is "grain" and the data iterator is not a
-        # PlaceHolderDataIterator and a specific checkpoint file exists for the iterator
-        case (
-            checkpoint_manager,
-            dataset_type,
-            data_iterator,
-        ) if (
-            dataset_type == "grain"
-            and data_iterator
-            and not isinstance(data_iterator, PlaceHolderDataIterator)
-            and (checkpoint_manager.directory / str(step) / "iter").exists()
-        ):
-          return _restore_grain_iterator(
-              checkpoint_manager, step, data_iterator, checkpoint_args, expansion_factor_real_data
-          )
-        # Case 3: Default/Fallback case.
-        # This case acts as a wildcard ('_') and matches if none of the preceding cases were met.
-        case _:
-          return (checkpoint_manager.restore(step, args=Composite(items=checkpoint_args)), None)
+      checkpoint_path = str(checkpoint_manager.directory / str(step))
+      with _handle_checkpoint_mismatch("restore checkpoint", checkpoint_path):
+        match (checkpoint_manager, dataset_type, data_iterator):
+          # Case 1: Matches if 'checkpoint_manager' is an instance of either EmergencyCheckpointManager
+          # or EmergencyReplicatorCheckpointManager. The '_' indicates that 'dataset_type' and
+          # 'data_iterator' can be any value and aren't used in this pattern.
+          case (checkpoint_manager, _, _) if isinstance(
+              checkpoint_manager,
+              (EmergencyCheckpointManager, EmergencyReplicatorCheckpointManager),
+          ):
+            restored = checkpoint_manager.restore(step, args=Composite(state=checkpoint_args)).state
+            _assert_no_shaped_dtype_struct(restored)
+            return (
+                restored,
+                None,
+            )
+          # Case 2: Matches if dataset type is "grain" and the data iterator is not a
+          # PlaceHolderDataIterator and a specific checkpoint file exists for the iterator
+          case (
+              checkpoint_manager,
+              dataset_type,
+              data_iterator,
+          ) if (
+              dataset_type == "grain"
+              and data_iterator
+              and not isinstance(data_iterator, PlaceHolderDataIterator)
+              and (checkpoint_manager.directory / str(step) / "iter").exists()
+          ):
+            return _restore_grain_iterator(
+                checkpoint_manager, step, data_iterator, checkpoint_args, expansion_factor_real_data
+            )
+          # Case 3: Default/Fallback case.
+          # This case acts as a wildcard ('_') and matches if none of the preceding cases were met.
+          case _:
+            restored = checkpoint_manager.restore(step, args=Composite(items=checkpoint_args))
+            _assert_no_shaped_dtype_struct(restored)
+            return (restored, None)
 
   if load_parameters_from_path != "":
     if isinstance(abstract_unboxed_pre_state, nnx.State):
       _, params, _ = nnx.split(abstract_unboxed_pre_state.model, nnx.Param, ...)
     else:
       params = abstract_unboxed_pre_state.params
 
-    restored_params = load_params_from_path(
-        load_parameters_from_path,
-        params,
-        checkpoint_storage_concurrent_gb,
-        use_ocdbt=use_ocdbt,
-        use_zarr3=use_zarr3,
-    )
+    with _handle_checkpoint_mismatch("load parameters", load_parameters_from_path):
+      restored_params = load_params_from_path(
+          load_parameters_from_path,
+          params,
+          checkpoint_storage_concurrent_gb,
+          use_ocdbt=use_ocdbt,
+          use_zarr3=use_zarr3,
+      )
+      _assert_no_shaped_dtype_struct(restored_params)
     return None, restored_params
   elif load_full_state_from_path != "":
     max_logging.log(f"Loading full state from path: {load_full_state_from_path}")
-    restored_state = _load_full_state_from_path(
-        path=load_full_state_from_path,
-        abstract_unboxed_pre_state=abstract_unboxed_pre_state,
-        enable_orbax_v1=enable_orbax_v1,
-        checkpoint_conversion_fn=checkpoint_conversion_fn,
-        source_checkpoint_layout=source_checkpoint_layout,
-        checkpoint_storage_concurrent_gb=checkpoint_storage_concurrent_gb,
-        use_ocdbt=use_ocdbt,
-        use_zarr3=use_zarr3,
-    )
+    with _handle_checkpoint_mismatch("load full state", load_full_state_from_path):
+      restored_state = _load_full_state_from_path(
+          path=load_full_state_from_path,
+          abstract_unboxed_pre_state=abstract_unboxed_pre_state,
+          enable_orbax_v1=enable_orbax_v1,
+          checkpoint_conversion_fn=checkpoint_conversion_fn,
+          source_checkpoint_layout=source_checkpoint_layout,
+          checkpoint_storage_concurrent_gb=checkpoint_storage_concurrent_gb,
+          use_ocdbt=use_ocdbt,
+          use_zarr3=use_zarr3,
+      )
+      _assert_no_shaped_dtype_struct(restored_state)
     return {"items": restored_state}, None
   else:
     max_logging.log("No existing checkpoints found, not restoring checkpoint.")
diff --git a/src/maxtext/utils/model_creation_utils.py b/src/maxtext/utils/model_creation_utils.py
@@ -1060,6 +1060,14 @@ def _walk_align(ckpt, model_arr, axes):
           )
 
       except Exception as e:
+        from maxtext.common.checkpointing import _is_structural_or_shape_mismatch
+        if _is_structural_or_shape_mismatch(e):
+          raise ValueError(
+              f"Checkpoint loading failed from '{config.load_parameters_path}'. "
+              "This is often caused by a mismatch in the 'scan_layers' configuration "
+              "(stacked vs unstacked) between your current execution command and "
+              f"the saved checkpoint. Original error: {e}"
+          ) from e
         raise ValueError(f"Checkpoint loading failed: {e}") from e
 
     if wrap_with_tunix_adapter:
diff --git a/tests/integration/checkpointing_test.py b/tests/integration/checkpointing_test.py
diff --git a/tests/unit/checkpointing_nnx_load_test.py b/tests/unit/checkpointing_nnx_load_test.py