Merge branch 'main' of github.com:AI-Hypercomputer/maxtext into shuningjin-fix

shuningjin · shuningjin · commit 8bcb4c188105 · 2026-04-25T22:55:43.000Z
diff --git a/docs/tutorials/post_training_index.md b/docs/tutorials/post_training_index.md
@@ -70,4 +70,5 @@ posttraining/rl_on_multi_host.md
 posttraining/knowledge_distillation.md
 posttraining/multimodal.md
 posttraining/full_finetuning.md
+posttraining/gepa_optimization.md
 ```
diff --git a/docs/tutorials/posttraining/gepa_optimization.md b/docs/tutorials/posttraining/gepa_optimization.md
@@ -0,0 +1,78 @@
+# GEPA Prompt Optimization for MaxText
+
+## Overview
+
+This document explains how to use **GEPA** (Generic Evaluation and Prompt Adaptation) to optimize system prompts for MaxText models. GEPA is an evolutionary framework ([GitHub Repository](https://github.com/gepa-ai/gepa), [Paper](https://arxiv.org/abs/2507.19457)) that iteratively refines prompts based on evaluation feedback, helping models perform better on specific tasks. A complete, runnable example notebook is provided in the repository at [maxtext_with_gepa.ipynb](../../../src/maxtext/examples/maxtext_with_gepa.ipynb).
+
+## How GEPA Optimization Works
+
+The optimization process relies on a collaborative loop between two Language Models (LMs):
+
+1. **Target Model**: This is the model being optimized. It attempts to solve the evaluation problems (e.g., AIME questions) using the current candidate system prompt. For example, this can be a `Qwen3-4B` model hosted on a local vLLM server.
+2. **Reflection LM**: This model reviews the reasoning traces and failures of the Target Model. It identifies recurring errors (e.g., mathematical errors or formatting issues) and proposes targeted updates to the system prompt. For example, a model like `Gemini 3 Flash Preview` can be used as the reflection model.
+
+### The Evolutionary Loop
+
+1. **Propose**: The Reflection LM proposes a new system prompt based on errors seen in previous runs.
+2. **Evaluate (Subsample)**: The Target Model solves a small random subset of problems using the new prompt. This serves as a quick screening step.
+3. **Full Evaluation**: If the subsample score improves, the prompt is evaluated on the full validation set.
+4. **Selection**: Successful prompts are added to the candidate pool, driving the evolution of domain-specific heuristics (such as circle packing formulas or prime factorization strategies) that eventually form the final optimized prompt.
+
+### Synergy via Prompt Merging
+
+A key feature used during the AIME experimentation was **Prompt Merging** (`use_merge=True`).
+
+As the evolutionary process runs, different branches might discover distinct, valid heuristics (e.g., one branch learns a rule for Geometry, while another learns a rule for Combinatorics).
+
+- **How It Works**: Instead of forcing a choice between these two distinct winning paths, GEPA attempts to merge them. The Reflection LM is instructed to synthesize the instructions from both candidates, deduplicating content and integrating the new knowledge into a single, unified system prompt.
+- **Why It Is Important**: Merging allows the optimization to achieve synergetic gains. By combining orthogonal prompt improvements, the final system prompt acts as a comprehensive "cheat sheet" covering multiple mathematical domains simultaneously, which is critical for the broad range of problems found in datasets like AIME.
+
+## Robust Evaluation with MathAdapter
+
+A critical component of the optimization setup is the custom `MathAdapter`.
+
+### Why the Custom Logic?
+
+Standard evaluation pipelines often use simple regular expressions to extract the answer from a model's response (e.g., capturing everything inside `\boxed{}`). However, competition math problems like AIME frequently require answers formatted in complex LaTeX (such as fractions `\boxed{\frac{a}{b}}` or nested expressions). A naive regex will break on the first closing brace `}`, failing to capture the full answer.
+
+The `MathAdapter` implements a robust **brace-counting parser** that correctly tracks nested LaTeX structures, ensuring the complete mathematical expression is extracted.
+
+### Why It Is Crucial for GEPA
+
+Prompt optimization frameworks like GEPA are highly sensitive to the reward signal (the evaluation score). If a model generates a correct answer but the evaluation logic fails to parse it correctly (a False Negative), the optimization loop receives faulty feedback. This noisy signal can cause GEPA to discard beneficial prompt mutations, ultimately leading to performance degradation instead of improvement.
+
+## Tutorial Notebook
+
+A complete, runnable tutorial is available in the repository as a Jupyter Notebook:
+[maxtext_with_gepa.ipynb](../../../src/maxtext/examples/maxtext_with_gepa.ipynb) (provided as an example)
+
+This notebook walks through:
+
+- Streaming the dataset.
+- Setting up a custom `MathAdapter` for float extraction.
+- Running the GEPA evolutionary loop.
+- Comparing accuracy before and after optimization.
+
+> [!NOTE]
+> In this tutorial, we utilize an out-of-tree version of vLLM tailored for MaxText models via the `maxtext_vllm_adapter`. For more information on serving MaxText models with vLLM, refer to the [Inference Guide](../inference.md).
+
+## Pointing GEPA to the Local vLLM Server
+
+By default, optimization frameworks might expect to communicate with remote model APIs. In our setup, we route the evaluation traffic to the locally running MaxText model on the vLLM server by overriding the API base URL.
+
+This is achieved by setting the following environment variables in the script:
+
+```python
+os.environ["OPENAI_API_BASE"] = "http://localhost:8000/v1"
+os.environ["OPENAI_API_KEY"] = "fake-key"
+```
+
+When the `MathAdapter` initializes the model (e.g., specifying `openai/Qwen/Qwen3-4B-Instruct-2507`), `litellm` (used by GEPA under the hood) intercepts the request and directs it to the local server running on the TPU host instead of attempting to connect to a remote OpenAI endpoint.
+
+## Case Study: AIME Prompt Optimization
+
+In our experiments with the **AIME (American Invitational Mathematics Examination)** dataset, we utilized **Qwen3-4B** as the Target Model (hosted locally via vLLM) and **Gemini 3 Flash Preview** as the Reflection LM.
+
+With this setup, GEPA successfully improved the model's accuracy from **49.0% to 54.0%** (a 5% absolute improvement).
+
+The optimization process discovered that injecting specific domain knowledge and heuristics (like circle packing formulas and square-free parts for number theory) significantly helped the model solve complex competition-level problems.
diff --git a/src/dependencies/requirements/generated_requirements/tpu-post-train-requirements.txt b/src/dependencies/requirements/generated_requirements/tpu-post-train-requirements.txt
@@ -81,6 +81,7 @@ frozenlist>=1.8.0
 fsspec>=2026.1.0
 gast>=0.6.0
 gcsfs>=2026.1.0
+gepa>=0.1.1
 gguf>=0.17.1
 google-api-core>=2.28.1
 google-api-python-client>=2.187.0
diff --git a/src/maxtext/configs/post_train/rl.yml b/src/maxtext/configs/post_train/rl.yml
@@ -68,6 +68,9 @@ rl:
   degenerate_group_masking: True
   # Upper-bound clipping epsilon for GRPO loss; defaults to grpo_epsilon when null.
   epsilon_high: null
+  # Number of model keys to chunk for resharding tensors between trainer and rollout devices.
+  # If null, the entire model is resharded at once.
+  reshard_chunk_size: null
 
 
 # ====== Models ======
diff --git a/src/maxtext/configs/types.py b/src/maxtext/configs/types.py
@@ -436,26 +436,16 @@ class Quantization(BaseModel):
   )
   weight_sparsity_n: int | None = Field(
       None,
-      description=(
-          "The 'N' in N:M sparsity, representing the maximum number of non-zero"
-          " values in each block."
-      ),
+      description=("The 'N' in N:M sparsity, representing the maximum number of non-zero" " values in each block."),
   )
   weight_sparsity_m: int | None = Field(
       None,
-      description=(
-          "The 'M' in N:M sparsity, representing the number of values in each"
-          " block."
-      ),
-  )
-  weight_sparsity_update_step: int = Field(
-      10, description="The step size for updating weight sparsity masks."
+      description=("The 'M' in N:M sparsity, representing the number of values in each" " block."),
   )
+  weight_sparsity_update_step: int = Field(10, description="The step size for updating weight sparsity masks.")
   weight_sparsity_start_step: int = Field(
       50,
-      description=(
-          "The first number of steps before updating the sparsity masks."
-      ),
+      description=("The first number of steps before updating the sparsity masks."),
   )
 
 
@@ -1822,6 +1812,13 @@ class RL(BaseModel):
   epsilon_high: Optional[float] = Field(
       None, description="Upper-bound clipping epsilon for GRPO loss. Defaults to epsilon when None (agentic only)."
   )
+  reshard_chunk_size: Optional[int] = Field(
+      None,
+      description=(
+          "Number of model keys to chunk for resharding tensors between trainer and rollout devices."
+          "If None, no chunking is applied, which may lead to OOM errors if tensors are too large."
+      ),
+  )
 
 
 class RLDataset(BaseModel):
diff --git a/src/maxtext/models/gemma4.py b/src/maxtext/models/gemma4.py
@@ -370,7 +370,7 @@ def __call__(
 
     next_layer_addition = mlp_lnx + residual
     layer_output = next_layer_addition
-    layer_output = layer_output * self.layer_scalar.value
+    layer_output = layer_output * jnp.asarray(self.layer_scalar.value, cfg.dtype)
 
     layer_output = nn.with_logical_constraint(layer_output, self.activation_axis_names)
 
diff --git a/src/maxtext/trainers/post_train/rl/train_rl.py b/src/maxtext/trainers/post_train/rl/train_rl.py
@@ -405,6 +405,7 @@ def create_rl_components(
           rollout_vllm_max_num_seqs=trainer_config.max_num_seqs,
           rollout_vllm_async_scheduling=trainer_config.async_scheduling,
           rollout_vllm_server_mode=trainer_config.rl.use_agentic_rollout,
+          rollout_vllm_reshard_chunk_size=trainer_config.rl.reshard_chunk_size,
           rollout_vllm_kwargs={
               "hf_overrides": trainer_config.vllm_hf_overrides,
               "enable_expert_parallel": sampler_config.enable_expert_parallel,
diff --git a/src/maxtext/trainers/pre_train/train.py b/src/maxtext/trainers/pre_train/train.py
@@ -387,31 +387,6 @@ def train_step(model, config, state_mesh_shardings, params_shardings, state, dat
   else:
     grads = raw_grads
 
-  # fp8 fix: sanitize NaN OWG (overwrite-with-gradient) stats before apply_gradients.
-  # Under FSDP, the fp8 output gradient amax can be NaN at step 0, which propagates into
-  # amax_history and corrupts future steps. Replace NaN OWG entries with the current state
-  # values (skip the amax update for that step) instead of letting NaN flow through.
-  # Also restore OWG values after apply_gradients to bypass optimizer corruption
-  # (Adam should not update fp8 scale/amax_history).
-  fp8_stats = dict(grads).get(maxtext_utils.OVERWRITE_WITH_GRADIENT, None)
-  if fp8_stats is not None:
-    if maxtext_utils.OVERWRITE_WITH_GRADIENT in state.params:
-      current_fp8 = state.params[maxtext_utils.OVERWRITE_WITH_GRADIENT]
-      fp8_stats = jax.tree_util.tree_map(
-          lambda new, cur: jnp.where(jnp.isnan(new), cur, new),
-          fp8_stats,
-          current_fp8,
-      )
-    else:
-      fp8_stats = jax.tree_util.tree_map(lambda x: jnp.nan_to_num(x, nan=0.0), fp8_stats)
-    grads = dict(grads)
-    grads[maxtext_utils.OVERWRITE_WITH_GRADIENT] = fp8_stats
-  # Zero out any remaining NaN in float gradients to prevent param corruption
-  grads = jax.tree_util.tree_map(
-      lambda x: jnp.nan_to_num(x, nan=0.0) if jnp.issubdtype(x.dtype, jnp.floating) else x,
-      grads,
-  )
-
   if config.optimizer_memory_host_offload:
     state = state.replace(
         opt_state=jax.device_put(
@@ -462,25 +437,7 @@ def move(path, value):
     )
   else:
     new_state = state.apply_gradients(grads=full_grads)
-  # fp8 fix: restore sanitized OWG values, bypassing any optimizer update to fp8 stats.
-  if fp8_stats is not None:
-    new_params = dict(new_state.params)
-    new_params[maxtext_utils.OVERWRITE_WITH_GRADIENT] = fp8_stats
-    new_state = new_state.replace(params=new_params)
-  has_batch_stats = (
-      config.weight_sparsity_n
-      and config.weight_sparsity_m
-      and bool(aux.get("batch_stats"))
-      and isinstance(state.params, dict)
-      and "batch_stats" in state.params
-  )
 
-  if has_batch_stats:
-    new_params = dict(new_state.params)
-    new_params["batch_stats"] = max_utils.unbox_logicallypartioned(
-        aux["batch_stats"]
-    )
-    new_state = new_state.replace(params=new_params)
   # Apply updates for Auxiliary-Loss-Free load balancing for DeepSeek family
   if config.routed_bias and config.routed_bias_update_rate > 0.0 and moe_bias_updates is not None:
     target_path = ("params", "decoder", "moe_layers", "DeepSeekMoeBlock_0", "MoeBlock_0", "gate", "bias")
diff --git a/tests/sparsity_test.py b/tests/sparsity_test.py
@@ -18,6 +18,7 @@
 import tempfile
 from absl.testing import absltest
 from absl.testing import parameterized
+import pytest
 from maxtext.trainers.pre_train import train
 from tests.utils.test_helpers import get_test_config_path
 
@@ -45,9 +46,8 @@ class Train(parameterized.TestCase):
           "use_sparsity": True,
       },
   )
-  def test_different_quant_sparsity_configs(
-      self, quantization: str, use_sparsity: bool
-  ):
+  @pytest.mark.tpu_only
+  def test_different_quant_sparsity_configs(self, quantization: str, use_sparsity: bool):
     test_tmpdir = os.environ.get("TEST_TMPDIR", gettempdir())
     outputs_dir = os.environ.get("TEST_UNDECLARED_OUTPUTS_DIR", test_tmpdir)
     args = [
@@ -81,11 +81,13 @@ def test_different_quant_sparsity_configs(
         f"metrics_file={os.path.join(outputs_dir, 'metrics.json')}",
     ]
     if use_sparsity:
-      args.extend([
-          "weight_sparsity_n=2",
-          "weight_sparsity_m=4",
-          "weight_sparsity_update_step=1",
-      ])
+      args.extend(
+          [
+              "weight_sparsity_n=2",
+              "weight_sparsity_m=4",
+              "weight_sparsity_update_step=1",
+          ]
+      )
     train_main(args)