Add MoE and MLA remat policies by abhinavgoel95 · Pull Request #3414 · AI-Hypercomputer/maxtext

abhinavgoel95 · 2026-03-13T21:04:48Z

Added moe_mlpwi, moe_mlpwi_0, moe_mlpwi_1, moe_mlpwo for MoE layers
Added query_wa_proj, kv_wa_proj for MLA layers
Updated base.yml, types.py, and pyconfig_deprecated.py

Description

This PR adds rematerialization policy support for Mixture of Experts (MoE) and Multi-head Latent Attention (MLA) layer tensors.

Previously, MaxText only supported remat policies for standard dense layer tensors. This prevented fine-grained memory optimization for MoE models (like Mixtral, DeepSeek V3) and models using MLA architecture (like DeepSeek V3).

This change adds six new configurable remat tensors:

MoE tensors: moe_mlpwi, moe_mlpwi_0, moe_mlpwi_1, moe_mlpwo
MLA tensors: query_wa_proj, kv_wa_proj

Users can now configure these tensors with device, offload, or remat policies in their config files, enabling better memory management for large MoE models (e.g., DeepSeek V3 671B).

Files modified:

src/maxtext/configs/base.yml - Added default 'remat' values
src/maxtext/configs/types.py - Added Field definitions with descriptions
src/maxtext/configs/pyconfig_deprecated.py - Added to validation whitelist

All new tensors default to 'remat', maintaining backward compatibility.

Tests

Tested with DeepSeek V3 671B (41 layers) on 128 GPUs with various remat configurations:

Baseline with all tensors set to remat - ✅ Works
Custom policies with selective offload and device placement - ✅ Works
Verified backward compatibility with Llama models (no regression)

Example config usage:

moe_mlpwi: 'offload'
query_wa_proj: 'device'

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in https://maxtext.readthedocs.io/en/latest/development.html#adding-new-documentation-files.

RissyRan · 2026-03-14T02:17:23Z

        layer_w0 = jax.lax.psum(layer_w0, "tensor_transpose")
      if self.config.mlp_bias:
        layer_w0 = layer_w0 + w0_bias
-      layer_w0 = adc.checkpoint_name(layer_w0, "mlpwi_0")


Heads up: this might affect all legacy TPU recipes/performance for MoE models. We should make an announcement after it gets merged. Thanks!

codecov · 2026-03-19T19:06:33Z

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/layers/attention_mla.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

NuojCheng

Can we also update

maxtext/src/maxtext/layers/decoders.py

Lines 334 to 438 in 2052c22

    
           def get_remat_policy(self): 
        
             """Get remat policy""" 
        
             policy = None 
        
             cfg = self.config 
        
             if cfg.remat_policy != "none": 
        
               if cfg.remat_policy in ("minimal_with_context", "minimal_flash"): 
        
                 # save all 
        
                 if cfg.remat_policy == "minimal_flash": 
        
                   max_logging.log("WARNING: 'minimal_flash' will be deprecated soon, please use 'minimal_with_context' instead.") 
        
                   max_logging.log("WARNING: 'minimal_flash' will be deprecated soon, please use 'minimal_with_context' instead.") 
        
                 policy = self.minimal_policy(with_context=True) 
        
               elif cfg.remat_policy == "minimal": 
        
                 # save all except context 
        
                 policy = self.minimal_policy() 
        
               elif cfg.remat_policy == "minimal_with_quantization": 
        
                 if cfg.scan_layers: 
        
                   warnings.warn( 
        
                       "Scan layers can introduce overhead to checkpointed values that in some configurations is slower" 
        
                       "than not checkpointing at all. If you are using scan layers, benchmark with and without quantization " 
        
                       "checkpointing in your workflow to see which is faster. Without scan layers, checkpointing quantizations is " 
        
                       "beneficial for performance." 
        
                   ) 
        
                 policy = self.minimal_policy(with_context=False, with_quantization=True) 
        
               elif cfg.remat_policy == "minimal_with_context_and_quantization": 
        
                 if cfg.scan_layers: 
        
                   warnings.warn( 
        
                       "Scan layers can introduce overhead to checkpointed values that in some configurations is slower" 
        
                       "than not checkpointing at all. If you are using scan layers, benchmark with and without quantization " 
        
                       "checkpointing in your workflow to see which is faster. Without scan layers, checkpointing quantizations is " 
        
                       "beneficial for performance." 
        
                   ) 
        
                 policy = self.minimal_policy(with_context=True, with_quantization=True) 
        
               elif cfg.remat_policy == "save_dot_with_context_except_mlp": 
        
                 policy = jax.checkpoint_policies.save_only_these_names( 
        
                     "query_proj", 
        
                     "value_proj", 
        
                     "key_proj", 
        
                     "qkv_proj", 
        
                     "context", 
        
                     "out_proj", 
        
                 ) 
        
               elif cfg.remat_policy == "save_dot_except_mlpwi": 
        
                 policy = jax.checkpoint_policies.save_only_these_names( 
        
                     "query_proj", 
        
                     "value_proj", 
        
                     "key_proj", 
        
                     "qkv_proj", 
        
                     "out_proj", 
        
                     "mlpwo", 
        
                 ) 
        
               elif cfg.remat_policy == "save_dot_except_mlp": 
        
                 policy = jax.checkpoint_policies.save_only_these_names( 
        
                     "query_proj", 
        
                     "value_proj", 
        
                     "key_proj", 
        
                     "qkv_proj", 
        
                     "out_proj", 
        
                 ) 
        
               elif cfg.remat_policy == "save_qkv_proj": 
        
                 policy = jax.checkpoint_policies.save_only_these_names( 
        
                     "query_proj", 
        
                     "value_proj", 
        
                     "key_proj", 
        
                     "qkv_proj", 
        
                 ) 
        
               elif cfg.remat_policy == "qkv_proj_offloaded": 
        
                 policy = jax.checkpoint_policies.save_and_offload_only_these_names( 
        
                     names_which_can_be_saved=[], 
        
                     names_which_can_be_offloaded=["query_proj", "value_proj", "key_proj"], 
        
                     offload_src="device", 
        
                     offload_dst="pinned_host", 
        
                 ) 
        
               elif cfg.remat_policy == "minimal_offloaded": 
        
                 # offload all except context 
        
                 policy = jax.checkpoint_policies.save_and_offload_only_these_names( 
        
                     names_which_can_be_saved=[], 
        
                     names_which_can_be_offloaded=[ 
        
                         "query_proj", 
        
                         "value_proj", 
        
                         "key_proj", 
        
                         "qkv_proj", 
        
                         "out_proj", 
        
                         "mlpwi_0", 
        
                         "mlpwi_1", 
        
                         "mlpwi", 
        
                         "mlpwo", 
        
                     ], 
        
                     offload_src="device", 
        
                     offload_dst="pinned_host", 
        
                 ) 
        
               elif cfg.remat_policy == "custom": 
        
                 policy = jax.checkpoint_policies.save_and_offload_only_these_names( 
        
                     names_which_can_be_saved=cfg.tensors_on_device, 
        
                     names_which_can_be_offloaded=cfg.tensors_to_offload, 
        
                     offload_src="device", 
        
                     offload_dst="pinned_host", 
        
                 ) 
        
               elif cfg.remat_policy == "save_out_proj": 
        
                 policy = jax.checkpoint_policies.save_only_these_names( 
        
                     "out_proj", 
        
                 ) 
        
               else: 
        
                 assert cfg.remat_policy == "full", "Remat policy needs to be on list of remat policies" 
        
                 policy = None 
        
             return policy

?

abhinavgoel95 requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, dipannita08, gagika, gobbleturk, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners March 13, 2026 21:04

abhinavgoel95 requested review from michelle-yooh, parambole and shuningjin as code owners March 13, 2026 22:27

RissyRan reviewed Mar 14, 2026

View reviewed changes

Add MoE and MLA remat policies

d7fd385

abhinavgoel95 force-pushed the abgoel/add-moe-mla-remat-policies branch from a3779fc to d7fd385 Compare March 16, 2026 17:52

gobbleturk approved these changes Mar 19, 2026

View reviewed changes

Fix pyink formatting in attention_test.py

5109d24

abhinavgoel95 force-pushed the abgoel/add-moe-mla-remat-policies branch from 2fc4407 to 5109d24 Compare March 24, 2026 18:48

gobbleturk added the pull ready label Mar 24, 2026

Remove moe_mlpwi for now

0ac53b5

copybara-service Bot merged commit de51021 into AI-Hypercomputer:main Mar 25, 2026
3 checks passed

NuojCheng reviewed Mar 25, 2026

View reviewed changes

This was referenced Mar 25, 2026

Need to add new tensor names to remat_policy in types.py abhinavgoel95/maxtext#2

Open

Need to add the new tensor names to remat_policy in types.py #3505

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MoE and MLA remat policies#3414

Add MoE and MLA remat policies#3414
copybara-service[bot] merged 3 commits intoAI-Hypercomputer:mainfrom
abhinavgoel95:abgoel/add-moe-mla-remat-policies

abhinavgoel95 commented Mar 13, 2026 •

edited

Loading

Uh oh!

RissyRan Mar 14, 2026

Uh oh!

codecov Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

NuojCheng left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	def get_remat_policy(self):
	"""Get remat policy"""
	policy = None
	cfg = self.config
	if cfg.remat_policy != "none":
	if cfg.remat_policy in ("minimal_with_context", "minimal_flash"):
	# save all
	if cfg.remat_policy == "minimal_flash":
	max_logging.log("WARNING: 'minimal_flash' will be deprecated soon, please use 'minimal_with_context' instead.")
	max_logging.log("WARNING: 'minimal_flash' will be deprecated soon, please use 'minimal_with_context' instead.")
	policy = self.minimal_policy(with_context=True)
	elif cfg.remat_policy == "minimal":
	# save all except context
	policy = self.minimal_policy()
	elif cfg.remat_policy == "minimal_with_quantization":
	if cfg.scan_layers:
	warnings.warn(
	"Scan layers can introduce overhead to checkpointed values that in some configurations is slower"
	"than not checkpointing at all. If you are using scan layers, benchmark with and without quantization "
	"checkpointing in your workflow to see which is faster. Without scan layers, checkpointing quantizations is "
	"beneficial for performance."
	)
	policy = self.minimal_policy(with_context=False, with_quantization=True)
	elif cfg.remat_policy == "minimal_with_context_and_quantization":
	if cfg.scan_layers:
	warnings.warn(
	"Scan layers can introduce overhead to checkpointed values that in some configurations is slower"
	"than not checkpointing at all. If you are using scan layers, benchmark with and without quantization "
	"checkpointing in your workflow to see which is faster. Without scan layers, checkpointing quantizations is "
	"beneficial for performance."
	)
	policy = self.minimal_policy(with_context=True, with_quantization=True)
	elif cfg.remat_policy == "save_dot_with_context_except_mlp":
	policy = jax.checkpoint_policies.save_only_these_names(
	"query_proj",
	"value_proj",
	"key_proj",
	"qkv_proj",
	"context",
	"out_proj",
	)
	elif cfg.remat_policy == "save_dot_except_mlpwi":
	policy = jax.checkpoint_policies.save_only_these_names(
	"query_proj",
	"value_proj",
	"key_proj",
	"qkv_proj",
	"out_proj",
	"mlpwo",
	)
	elif cfg.remat_policy == "save_dot_except_mlp":
	policy = jax.checkpoint_policies.save_only_these_names(
	"query_proj",
	"value_proj",
	"key_proj",
	"qkv_proj",
	"out_proj",
	)
	elif cfg.remat_policy == "save_qkv_proj":
	policy = jax.checkpoint_policies.save_only_these_names(
	"query_proj",
	"value_proj",
	"key_proj",
	"qkv_proj",
	)
	elif cfg.remat_policy == "qkv_proj_offloaded":
	policy = jax.checkpoint_policies.save_and_offload_only_these_names(
	names_which_can_be_saved=[],
	names_which_can_be_offloaded=["query_proj", "value_proj", "key_proj"],
	offload_src="device",
	offload_dst="pinned_host",
	)
	elif cfg.remat_policy == "minimal_offloaded":
	# offload all except context
	policy = jax.checkpoint_policies.save_and_offload_only_these_names(
	names_which_can_be_saved=[],
	names_which_can_be_offloaded=[
	"query_proj",
	"value_proj",
	"key_proj",
	"qkv_proj",
	"out_proj",
	"mlpwi_0",
	"mlpwi_1",
	"mlpwi",
	"mlpwo",
	],
	offload_src="device",
	offload_dst="pinned_host",
	)
	elif cfg.remat_policy == "custom":
	policy = jax.checkpoint_policies.save_and_offload_only_these_names(
	names_which_can_be_saved=cfg.tensors_on_device,
	names_which_can_be_offloaded=cfg.tensors_to_offload,
	offload_src="device",
	offload_dst="pinned_host",
	)
	elif cfg.remat_policy == "save_out_proj":
	policy = jax.checkpoint_policies.save_only_these_names(
	"out_proj",
	)
	else:
	assert cfg.remat_policy == "full", "Remat policy needs to be on list of remat policies"
	policy = None
	return policy

Conversation

abhinavgoel95 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

RissyRan Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

NuojCheng left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abhinavgoel95 commented Mar 13, 2026 •

edited

Loading

codecov Bot commented Mar 19, 2026 •

edited

Loading