AI-Hypercomputer
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/reference/core_concepts/batch_size.md‎
Lines changed: 85 additions & 0 deletions b/‎docs/reference/core_concepts/batch_size.md‎
Lines changed: 85 additions & 0 deletions
diff --git a/‎src/maxtext/configs/base.yml‎
Lines changed: 1 addition & 0 deletions b/‎src/maxtext/configs/base.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/maxtext/configs/pyconfig.py‎
Lines changed: 4 additions & 1 deletion b/‎src/maxtext/configs/pyconfig.py‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎src/maxtext/configs/types.py‎
Lines changed: 11 additions & 5 deletions b/‎src/maxtext/configs/types.py‎
Lines changed: 11 additions & 5 deletions
diff --git a/‎src/maxtext/input_pipeline/input_pipeline_utils.py‎
Lines changed: 8 additions & 6 deletions b/‎src/maxtext/input_pipeline/input_pipeline_utils.py‎
Lines changed: 8 additions & 6 deletions
@@ -41,6 +41,7 @@ See our guide on running MaxText in decoupled mode, without any GCP dependencies
 
 ## 🔥 Latest news 🔥
 
+* \[February 27, 2026\] New MaxText structure! MaxText has been restructured according to [RESTRUCTURE.md](https://github.com/AI-Hypercomputer/maxtext/blob/1b9e38aa0a19b6018feb3aed757406126b6953a1/RESTRUCTURE.md). Please feel free to share your thoughts and feedback.
 * \[December 22, 2025\] [Muon optimizer](https://kellerjordan.github.io/posts/muon) is now supported.
 * \[December 10, 2025\] DeepSeek V3.1 is now supported. Use existing configs for [DeepSeek V3 671B](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/maxtext/configs/models/deepseek3-671b.yml) and load in V3.1 checkpoint to use model.
 * \[December 9, 2025\] [New RL and SFT Notebook tutorials](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/examples) are available.
 
@@ -0,0 +1,85 @@
+<!--
+ Copyright 2026 Google LLC
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+      https://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ -->
+
+# Batch Size
+
+This document explains the different concepts of "batch size" within MaxText and how to configure them to tune performance and manage memory.
+
+## Per-Device Batch Size
+
+`per_device_batch_size` is the number of training examples processed by a single device in one forward and backward pass. This value impacts the memory usage on each device and is a configuration parameter in `configs/base.yml`
+
+## Global Batch Size
+
+`global_batch_to_train` is the total number of training examples processed before the optimizer performs a single weight update. It is the effective batch size for training, calculated as:
+
+`global_batch_to_train = per_device_batch_size x number_of_devices x gradient_accumulation_steps`
+
+You can set `per_device_batch_size` and `gradient_accumulation_steps` in `configs/base.yml`.
+
+`global_batch_to_load` is the total number of examples the data input pipeline loads from storage at once. It can be larger than the training batch size to optimize I/O performance, and is calculated as:
+
+`global_batch_to_load` = `global_batch_size_to_train_on x expansion_factor_real_data`
+
+When `expansion_factor_real_data > 1`, only a subset of hosts read data from the source (e.g., a GCS bucket). These "loading hosts" read more data than they need for their own devices and distribute the surplus to other "non-loading" hosts. This reduces the number of concurrent connections to the data source, which can significantly improve I/O throughput. When set to between 0 and 1, it's for grain pipeline to use a smaller chip count to read checkpoint from a larger chip count job. Details in https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/guides/data_input_pipeline/data_input_grain.md#using-grain.
+
+## Gradient Accumulation Steps
+
+`gradient_accumulation_steps` defines how many forward/backward passes are performed before the optimizer updates the model weights. The gradients from each pass are accumulated (summed). It is discussed in more detail [here](https://maxtext.readthedocs.io/en/latest/reference/core_concepts/tiling.html#gradient-accumulation).
+
+For example, if `gradient_accumulation_steps` is set to `4`, the model will execute four forward and backward passes, sum the gradients, and then apply a single optimizer step. This achieves the same effective global batch size as quadrupling the `per_device_batch_size` with significantly less memory, but can potentially lead to lower MFU.
+
+## Pipeline Microbatches
+
+When pipeline parallelism is enabled, the global batch is split into smaller chunks called **microbatches**. These are fed into the pipeline sequentially, allowing different stages of the model to work on different microbatches simultaneously.
+
+The `num_pipeline_microbatches` parameter in `configs/base.yml` configures how many of these smaller chunks the global batch is divided into. It must be a multiple of the total number of pipeline stages (`ici_pipeline_parallelism` x `dcn_pipeline_parallelism`).
+
+The choice of `num_pipeline_microbatches` is a trade-off between reducing pipeline idle time and the computational efficiency within each stage. More microbatches reduces the "Pipeline Bubble" but leads to smaller matrix multiplications within each stage. Very small operations may not fully saturate the compute units of the hardware, potentially lowering arithmetic intensity.
+
+## Batch Size Ramp-up
+
+MaxText supports gradually increasing the batch size during the initial phase of training to improve stability, a technique also used in frameworks like [NVIDIA's NeMo Megatron](https://docs.nvidia.com/nemo-framework/user-guide/24.09/nemotoolkit/nlp/nemo_megatron/rampup_batch_size.html). This can be configured in `configs/base.yml`:
+
+- Setting `enable_rampup_batch_size=True` activates the ramp-up process.
+- `per_device_batch_size_start`: The minimum batch size to start training on.
+- `per_device_batch_size`: The target batch size to stabilize on at the end of the ramp-up process.
+- `per_device_batch_size_increment`: How much batch size increases for each ramp-up stage.
+- `global_rampup_samples`: The total number of samples to process across all ramp-up stages.
+
+The ramp-up is based on the number of samples processed, not the number of training steps. Each stage processes an equal number of samples before batch size is increased.
+
+The number of stages is determined by:
+
+`num_increments = (per_device_batch_size - per_device_batch_size_start) / per_device_batch_size_increment`
+
+The total number of ramp-up samples (`global_rampup_samples`) is then distributed equally across these stages. The number of samples processed in each stage is determined by:
+
+`samples_per_increment = global_rampup_samples / num_increments`
+
+During training, the model processes `samples_per_increment` samples at the current batch size. Once this threshold is reached, the batch size is increased by `per_device_batch_size_increment` until the target `per_device_batch_size` is reached. This entire process is managed by the `RampupBatchManager` class.
+
+## Reinforcement Learning (RL) Batch Size
+
+The batch size parameters for RL training are defined in `configs/post_train/rl.yml`:
+
+- `batch_size` refers to the number of unique prompts loaded from the dataset in a single batch. For instance, `batch_size=1` means one prompt is processed at a time by the data loader.
+
+- `num_generations` is the number of times the policy generates multiple responses for a given prompt within a single training step.
+
+- The effective training batch is the total number of prompt-response pairs used in a training step, calculated as `batch_size x num_generations`. It is determined by the number of responses generated for each prompt, which is configured by `num_generations`.
+
+- `micro_batch_size` is used to split the batch of prompt-response pairs into smaller chunks for memory management. This enables overlapping the rollout phase (generating responses) of one micro-batch with the training phase (updating model weights) of the previous micro-batch, which can improve hardware utilization. A value of `-1` means no micro-batching is enabled.
@@ -779,6 +779,7 @@ adam_b2: 0.95 # Exponential decay rate to track the second moment of past gradie
 adam_eps: 1.e-8 # A small constant applied to denominator outside of the square root.
 adam_eps_root: 0. # A small constant applied to denominator inside the square root.
 adam_weight_decay: 0.1 # AdamW Weight decay
+adamw_mask: [] # List of parameter names/patterns to exclude from weight decay in AdamW, like ['bias', '.*norm', '.*ln.*'].
 mu_dtype: "" # data type to store "mu" of AdamW tracking the first moment. Inherits from  weight_dtype if unset.
 # Setting nu_dtype is not yet supported by optax, instead nu_dtype is always inherited from weights.
 # See b/399961932 for more.
 
@@ -135,7 +135,10 @@ def _prepare_for_pydantic(raw_keys: dict[str, Any]) -> dict[str, Any]:
         new_value = [new_value]
 
     # An empty value provided in the configuration is treated as None
-    if key in ("hf_train_files", "hf_eval_files") and new_value == "":
+    if (
+        key in ("hf_train_files", "hf_eval_files", "hf_access_token", "hf_name", "hf_data_dir", "hf_eval_split")
+        and new_value == ""
+    ):
       new_value = None
 
     if key == "run_name" and new_value is None:
 
@@ -994,11 +994,11 @@ class HfDataset(BaseModel):
   """Configuration specific to HuggingFace datasets."""
 
   hf_path: str = Field("", description="Path of the Hugging Face dataset.")
-  hf_name: str = Field("", description="Name of the Hugging Face dataset.")
-  hf_data_dir: PathStr = Field("", description="Data directory for the HF dataset.")
-  hf_train_files: Optional[str] = Field(None, description="Files for the HF training split.")
-  hf_eval_split: str = Field("", description="Name of the HF evaluation split.")
-  hf_eval_files: Optional[str] = Field(None, description="Files for the HF evaluation split.")
+  hf_name: None | str = Field(None, description="Name of the Hugging Face dataset.")
+  hf_data_dir: None | PathStr = Field(None, description="Data directory for the HF dataset.")
+  hf_train_files: None | str = Field(None, description="Files for the HF training split.")
+  hf_eval_split: None | str = Field(None, description="Name of the HF evaluation split.")
+  hf_eval_files: None | str = Field(None, description="Files for the HF evaluation split.")
   hf_access_token: None | str = Field(None, description="Hugging Face API access token.")
 
 
@@ -1175,6 +1175,12 @@ class AdamW(BaseModel):
       description="A small constant for numerical stability (epsilon), applied inside of the square root.",
   )
   adam_weight_decay: float = Field(0.1, description="Weight decay regularization.")
+  adamw_mask: list[str] = Field(
+      default_factory=list,
+      description=(
+          "List of parameter names/patterns to exclude from weight decay in AdamW," " like ['bias', '.*norm', '.*ln.*']"
+      ),
+  )
   mu_dtype: str = Field(
       "",
       description="Data type for 'mu' (first moment) in AdamW. Inherits from weight_dtype if empty.",
 
@@ -187,7 +187,8 @@ def _get_completion_in_chat_template(tokenizer_model, round_msgs):
     A string representing the completion formatted by the chat template.
   """
   prompt_completion_tokens = tokenizer_model.apply_chat_template(round_msgs, add_generation_prompt=False, tokenize=True)
-  prompt_tokens = tokenizer_model.apply_chat_template(round_msgs[:-1], add_generation_prompt=False, tokenize=True)
+  # include generation_prompt as part of the prompt tokens
+  prompt_tokens = tokenizer_model.apply_chat_template(round_msgs[:-1], add_generation_prompt=True, tokenize=True)
 
   # attention masks in BatchEncoding are effectively ignored
   if hasattr(prompt_completion_tokens, INPUT_TOKENS_KEY):
@@ -209,7 +210,7 @@ def _get_completion_in_chat_template(tokenizer_model, round_msgs):
 
 def apply_chat_template(example, tokenizer_model, data_column_name):
   """Formats conversational data by applying the tokenizer's chat template
-  and identifying prompt/completion segments.
+  and identifying prompt/completion segments for SFT masking.
 
   Args:
     example: A dictionary containing conversational data. It is expected to have a key
@@ -223,9 +224,10 @@ def apply_chat_template(example, tokenizer_model, data_column_name):
     The modified `example` dictionary.
       - The `data_column_name` column will be updated to a list of
         messages, each formatted according to the tokenizer's chat template.
-      - A new column named "is_prompt" will be added, where `True`
-        indicates a system message or a user message (prompt) and `False` indicates an assistant
-        message (completion).
+      - A new column "is_prompt" is added, where `True` indicates the
+        tokens contain the system message, user message, and generation
+        prompt (if applicable). `False` indicates the expected LLM
+        completion, excluding the assistant's start tokens.
   """
   messages = []
   is_prompt = []
@@ -239,7 +241,7 @@ def apply_chat_template(example, tokenizer_model, data_column_name):
       elif message["role"] == "user":
         round_msgs.append(message)
         prompt_in_chat_template = tokenizer_model.apply_chat_template(
-            round_msgs, add_generation_prompt=False, tokenize=False
+            round_msgs, add_generation_prompt=True, tokenize=False
         )
         messages.append(prompt_in_chat_template)
         is_prompt.append(True)