Skip to content

Support of gdn kernel from tpu-inference#4051

Open
khatwanimohit wants to merge 1 commit into
mainfrom
mohit/gdn_support
Open

Support of gdn kernel from tpu-inference#4051
khatwanimohit wants to merge 1 commit into
mainfrom
mohit/gdn_support

Conversation

@khatwanimohit
Copy link
Copy Markdown
Collaborator

@khatwanimohit khatwanimohit commented Jun 3, 2026

Description

  • This PR is inspired from @NicoGrande's PR Nicogrande/add gdn support #3917. ( Thank you @NicoGrande, we miss you! 😊 )
  • GDN Integration and Sharding Support: Integrated Gated Delta Net (GDN) logic from nicogrande/add-gdn-support .
  • GDN Expert Replication: Added experimental self._gdn_replicate_expert option (triggered via the MAXTEXT_GDN_REPLICATE_EXPERT environment variable) to control whether attn_head is set to ShardingAxisName.MODEL or ShardingAxisName.ATTN_HEAD.
  • Unit Test Fixes: Resolved initialization errors in attention_test.py by ensuring self.mesh is properly initialized
  • Adds support for profiling in vllm_decode.py

You can also provide a comma-separated list. If you don't want to close a bug but
simply to reference it, use BUGS, e.g.:
BUGS: b/517158881

Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned.
This label is used for administrative purposes. Please do not add it manually.

Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

NEW_MODEL_DESIGN=1 \
    PYTHONPATH=/home/mohitkhatwani_google_com/workspace/maxtext/src \
    MAXTEXT_GDN_REPLICATE_EXPERT=true \
    /home/mohitkhatwani_google_com/workspace/max_venv/bin/python3 \
      -m maxtext.inference.vllm_decode src/maxtext/configs/base.yml \
      base_output_directory=gs://runner-maxtext-logs \
      run_name=mohit-qwen3.5-maxtext-bench-$RANDOM \
      model_name=qwen3.5-35b-a3b \
      tokenizer_path=Qwen/Qwen3.5-35B-A3B \
      vllm_hf_overrides='{architectures: ["MaxTextForCausalLM"]}' \
      load_parameters_path=gs://maxtext-model-checkpoints/qwen3.5-35b-a3b/unscanned/0/items \
      ici_tensor_parallelism=4 \
      hbm_utilization_vllm=0.5 \
      prompt="Tell me three fun facts about Buenos Aires." \
      decode_sampling_temperature=0.0 decode_sampling_nucleus_p=1.0 decode_sampling_top_k=0.0 \
      pure_nnx_decoder=True use_chat_template=True max_num_seqs=1 \
      enable_dp_attention=False prefuse_moe_weights=True max_target_length=64 debug_sharding=False \
      scan_layers=false profiler=xplane

Profile: https://xprof.corp.google.com/trace_viewer/mohitkhatwani-11546393038070241787?view_start=71184.720&view_end=71204.219

Logs: https://paste.googleplex.com/5464600022745088

Decode performance: 25ms

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

🤖 Hi @khatwanimohit, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This Pull Request introduces support for the Gated Delta Net (GDN) kernel from tpu-inference into MaxText, specifically for vLLM decoding. It includes integration of the GDN logic, sharding support, profiling capabilities, and necessary monkey-patches for hybrid Attention+GDN models.

🔍 General Feedback

  • Configurability: The use of environment variables like MAXTEXT_GDN_REPLICATE_EXPERT for model behavior should be transitioned to the formal Config system for better reproducibility.
  • Robustness: Global monkey-patching of library classes (e.g., ModelConfig.uses_mrope and KVCacheManager) is quite brittle and may lead to issues with future updates or different model types.
  • Performance: The dynamic padding logic in Qwen3NextGatedDeltaNet could trigger frequent JAX re-compilations if batch sizes vary, which should be addressed for production workloads.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

🤖 Hi @khatwanimohit, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

This Pull Request introduces comprehensive support for Gated Delta Net (GDN) kernels from tpu-inference into MaxText, including integration with vLLM and necessary sharding updates. The changes are logically sound and significantly improve the flexibility of the Qwen3/Qwen3.5 model implementations.

🔍 General Feedback

  • Consistency: The simplification of kv_cache handling in decoders.py is a great improvement, bringing Qwen3 into alignment with other models in the codebase.
  • Complexity: The monkey-patching and adapter logic are necessary evils for this level of integration, but should be monitored closely for breakages when upstream vLLM or tpu-inference APIs evolve.
  • Performance: The use of shard_map and specialized kernels in qwen3.py demonstrates a high level of optimization for TPU sharding.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

🤖 I'm sorry @khatwanimohit, but I was unable to process your request. Please see the logs for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant