Add distillation reference guide and online-distillation recipe#3729
Add distillation reference guide and online-distillation recipe#3729
Conversation
|
🤖 Hi @gagika, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This Pull Request introduces comprehensive documentation for online distillation in MaxText, including a new reference guide and an expanded tutorial. It also adds Gemma 4 model details and configuration updates for MoE load balancing.
🔍 General Feedback
- The new distillation guide is very well-structured, covering architecture, loss anatomy, and practical tuning advice.
- Most documentation updates are accurate and include helpful examples for both single-host and multi-host (XPK) setups.
- There is a recurring pattern of backslash escaping for MyST directives in
docs/guides.mdthat appears to be a regression and should be corrected to ensure proper rendering.
| ::::\{grid} 1 2 2 2 | ||
| :gutter: 2 | ||
|
|
||
| :::{grid-item-card} ⚡ Optimization | ||
| :::\{grid-item-card} ⚡ Optimization |
There was a problem hiding this comment.
🟠 The backslash escaping \{grid} and \{grid-item-card} is likely a mistake. In MyST markdown, directives are typically formatted as :::{directive}. Adding a backslash before the curly brace will likely cause the documentation builder to render the brace and directive name as literal text instead of parsing it as a directive. This appears to have been applied to all cards in this file.
| ::::\{grid} 1 2 2 2 | |
| :gutter: 2 | |
| :::{grid-item-card} ⚡ Optimization | |
| :::\{grid-item-card} ⚡ Optimization | |
| ::::{grid} 1 2 2 2 | |
| :gutter: 2 | |
| :::{grid-item-card} ⚡ Optimization |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Description
Adds a new reference guide (docs/guides/distillation.md) covering the online distillation trainer — loss anatomy, α/β/temperature schedules, layer-index selection, monitoring, and troubleshooting — and extends the knowledge-distillation tutorial with an Online Distillation section (Pattern A compression, Pattern B depth-pruning recovery, XPK multi-host, and the offline top-k logits variant). The new guide is linked from docs/guides.md.
Tests
Run most of commands.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.