Skip to content

Snehalv dsv4 conv utility#4078

Draft
snehalv2002 wants to merge 22 commits into
mainfrom
snehalv-dsv4-conv-utility
Draft

Snehalv dsv4 conv utility#4078
snehalv2002 wants to merge 22 commits into
mainfrom
snehalv-dsv4-conv-utility

Conversation

@snehalv2002
Copy link
Copy Markdown
Collaborator

@snehalv2002 snehalv2002 commented Jun 5, 2026

Description

Standalone and conversion util support for DeepSeek V4 Scanned.
FIXES: b/509930555

Tests

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

parambole and others added 22 commits May 14, 2026 17:45
…near)

Implement architectural core primitives required for DeepSeek-V4 integration into MaxText:

- DeepSeekV4RMSNorm & DeepSeekV4UnweightedRMSNorm: RMS normalization layers utilizing float32 variance pooling. Includes unweighted scale-free variants that avoid allocating or synchronizing trainable weight parameters.
- DeepSeekGroupedLinear: Block-diagonal grouped linear projection layer supporting parallel group projection via einsum broadcasting ([B, S, hc_mult, D] -> [B, S, D]).
- DeepSeekV4RotaryEmbedding: Interleaved partial rotary positional embedding pairing consecutive even/odd channels.
- Unit test suite (deepseek_v4_vs_reference_test.py) validating numerical parity against PyTorch reference implementations at atol=1e-5, rtol=1e-5.
…outer, RoutedMoE)

Implement Mixture of Experts routing gates and execution layers for DeepSeek-V4 integration into MaxText:

- HashRouter: Token routing mechanism utilizing MD5 hash projections for deterministic expert assignment.
- TopKRouter: Gated top-k router implementing sigmoid scaling and score normalization.
- RoutedMoE & RoutedAndSharedMoE: Execution layers supporting layer_idx routing and FP32 expert summation parity.
- Parity verification: Extended unit test suite (deepseek_v4_vs_reference_test.py) validating routing parity against PyTorch reference implementations at atol=1e-5, rtol=1e-5.
…ghtningIndexer)

Implement compressed attention mechanisms and indexer modules for DeepSeek-V4 integration into MaxText:

- CSACompressor & HCACompressor: Long-range attention compressors supporting causal block bias and YaRN frequency scaling decoupling.
- LightningIndexer: Memory-efficient indexer module implementing sentinel masking and dynamic RoPE scaling.
- Configuration: Register attention compression hyperparameters (compress_ratios, index_head_dim, sliding_window) in types.py and base.yml.
- Parity verification: Extended unit test suite (deepseek_v4_vs_reference_test.py) validating attention compression parity against PyTorch reference implementations at atol=1e-5, rtol=1e-5.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants