Support of gdn kernel from tpu-inference by khatwanimohit · Pull Request #4051 · AI-Hypercomputer/maxtext

khatwanimohit · 2026-06-03T17:29:22Z

Description

This PR is inspired from @NicoGrande's PR Nicogrande/add gdn support #3917. ( Thank you @NicoGrande, we miss you! 😊 )
GDN Integration and Sharding Support: Integrated Gated Delta Net (GDN) logic from nicogrande/add-gdn-support .
GDN Expert Replication: Added experimental self._gdn_replicate_expert option (triggered via the MAXTEXT_GDN_REPLICATE_EXPERT environment variable) to control whether attn_head is set to ShardingAxisName.MODEL or ShardingAxisName.ATTN_HEAD.
Unit Test Fixes: Resolved initialization errors in attention_test.py by ensuring self.mesh is properly initialized
Adds support for profiling in vllm_decode.py

You can also provide a comma-separated list. If you don't want to close a bug but
simply to reference it, use BUGS, e.g.:
BUGS: b/517158881

Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned.
This label is used for administrative purposes. Please do not add it manually.

Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

NEW_MODEL_DESIGN=1 \
    PYTHONPATH=/home/mohitkhatwani_google_com/workspace/maxtext/src \
    MAXTEXT_GDN_REPLICATE_EXPERT=true \
    /home/mohitkhatwani_google_com/workspace/max_venv/bin/python3 \
      -m maxtext.inference.vllm_decode src/maxtext/configs/base.yml \
      base_output_directory=gs://runner-maxtext-logs \
      run_name=mohit-qwen3.5-maxtext-bench-$RANDOM \
      model_name=qwen3.5-35b-a3b \
      tokenizer_path=Qwen/Qwen3.5-35B-A3B \
      vllm_hf_overrides='{architectures: ["MaxTextForCausalLM"]}' \
      load_parameters_path=gs://maxtext-model-checkpoints/qwen3.5-35b-a3b/unscanned/0/items \
      ici_tensor_parallelism=4 \
      hbm_utilization_vllm=0.5 \
      prompt="Tell me three fun facts about Buenos Aires." \
      decode_sampling_temperature=0.0 decode_sampling_nucleus_p=1.0 decode_sampling_top_k=0.0 \
      pure_nnx_decoder=True use_chat_template=True max_num_seqs=1 \
      enable_dp_attention=False prefuse_moe_weights=True max_target_length=64 debug_sharding=False \
      scan_layers=false profiler=xplane

Profile: https://xprof.corp.google.com/trace_viewer/mohitkhatwani-11546393038070241787?view_start=71184.720&view_end=71204.219

Logs: https://paste.googleplex.com/5464600022745088

Decode performance: 25ms

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-06-03T17:33:49Z

Codecov Report

❌ Patch coverage is 17.24138% with 144 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...t/integration/vllm/maxtext_vllm_adapter/adapter.py	0.00%	104 Missing ⚠️
src/maxtext/models/qwen3.py	42.85%	34 Missing and 2 partials ⚠️
.../integration/vllm/maxtext_vllm_adapter/__init__.py	0.00%	2 Missing ⚠️
src/maxtext/layers/decoders.py	0.00%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-06-03T18:40:14Z

🤖 Hi @khatwanimohit, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request introduces support for the Gated Delta Net (GDN) kernel from tpu-inference into MaxText, specifically for vLLM decoding. It includes integration of the GDN logic, sharding support, profiling capabilities, and necessary monkey-patches for hybrid Attention+GDN models.

🔍 General Feedback

Configurability: The use of environment variables like MAXTEXT_GDN_REPLICATE_EXPERT for model behavior should be transitioned to the formal Config system for better reproducibility.
Robustness: Global monkey-patching of library classes (e.g., ModelConfig.uses_mrope and KVCacheManager) is quite brittle and may lead to issues with future updates or different model types.
Performance: The dynamic padding logic in Qwen3NextGatedDeltaNet could trigger frequent JAX re-compilations if batch sizes vary, which should be addressed for production workloads.

github-actions · 2026-06-03T19:08:00Z

🤖 Hi @khatwanimohit, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request introduces comprehensive support for Gated Delta Net (GDN) kernels from tpu-inference into MaxText, including integration with vLLM and necessary sharding updates. The changes are logically sound and significantly improve the flexibility of the Qwen3/Qwen3.5 model implementations.

🔍 General Feedback

Consistency: The simplification of kv_cache handling in decoders.py is a great improvement, bringing Qwen3 into alignment with other models in the codebase.
Complexity: The monkey-patching and adapter logic are necessary evils for this level of integration, but should be monitored closely for breakages when upstream vLLM or tpu-inference APIs evolve.
Performance: The use of shard_map and specialized kernels in qwen3.py demonstrates a high level of optimization for TPU sharding.

github-actions · 2026-06-03T19:12:12Z

🤖 I'm sorry @khatwanimohit, but I was unable to process your request. Please see the logs for more details.

support of gdn kernel from tpu-inference

e1aa0bb

khatwanimohit added the gemini-review label Jun 3, 2026

github-actions Bot reviewed Jun 3, 2026

View reviewed changes

khatwanimohit added gemini-review and removed gemini-review labels Jun 3, 2026

github-actions Bot reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support of gdn kernel from tpu-inference#4051

Support of gdn kernel from tpu-inference#4051
khatwanimohit wants to merge 1 commit into
mainfrom
mohit/gdn_support

khatwanimohit commented Jun 3, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

khatwanimohit commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

khatwanimohit commented Jun 3, 2026 •

edited

Loading

codecov Bot commented Jun 3, 2026 •

edited

Loading