[Distillation] base learn-to-init llama attention for distillation by vlad-karp · Pull Request #3688 · AI-Hypercomputer/maxtext

vlad-karp · 2026-04-17T00:29:37Z

Description

This PR introduces the base implementation for Learn-to-Init (LTI) attention for distillation (llama only be can easily be generalized).

Relevant details and context:

Problem being solved: Setting up the foundational LTI components required for effective attention distillation of LLaMA models.
Implementation details:
- Adds the core LTI logic in a new learn_to_init_layer.py module.
- Updates the distillation pipeline (distillation_utils.py and train_distill.py) and decoders to support the new attention layer.
- 2 LTI implementation for GQA - using bi-linear and global linear map options.
- Configures the system to get LTI student init-time teacher shapes directly from the configuration.

Tests

Added a new unit test for LearnToInitDense as well as teacher to student weight injection logic.
src/maxtext/tests/post_training/unit/learn-to-init_test.py

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-17T00:35:10Z

Codecov Report

❌ Patch coverage is 24.56140% with 172 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/layers/learn_to_init_layer.py	16.44%	127 Missing ⚠️
...text/trainers/post_train/distillation/lti_utils.py	17.39%	36 Missing and 2 partials ⚠️
.../trainers/post_train/distillation/train_distill.py	70.58%	4 Missing and 1 partial ⚠️
src/maxtext/layers/decoders.py	0.00%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

gagika · 2026-04-23T01:25:04Z

      self,
      raw_iterator: Any | None,
      root_directory: str | None = None,
+      student_config: Any | None = None,


nit: can it be None?
perhaps change default and data type if it can't.

gagika · 2026-04-23T01:28:04Z

+  It effectively collapses the learn-to-init parameterization back into a standard
+  decoder architecture, modifying the `student_model` in-place.
+
+  NOTE: works for ToNXX decoder model and layer-scan mode only


could you throw an exception if it's not layer-scan mode?

gagika · 2026-04-23T01:30:33Z

  LLAMA4 = "llama4"
  OLMO3 = "olmo3"

+  LLAMA2LTI = "llama2-learn-to-init"


for naming convention could you use "_" instead of "-" e.g. LLAMA2_LTI = "llama2_lti"

gagika · 2026-04-23T01:32:35Z

could you name the test with "_" insead of "-"? e..g. learn_to_init_test.py

gagika · 2026-04-23T01:35:31Z

        self._buffered_train_metrics.additional_metrics[name] = ([], distillation_utils.weighted_mean)

      self._buffered_train_metrics.additional_metrics[name][0].append(value)
+    max_logging.log(f"Distillation metrics: {aux}")


is it logged at every step or once? as it's inside _post_process_train_step

base learn-to-init llama attention for distillation

7b60aa9

vlad-karp added 3 commits April 17, 2026 01:55

formatting fixes

3199e86

format fix

87f6a8a

Added global linear map option

18b6ff9

vlad-karp changed the title ~~base learn-to-init llama attention for distillation~~ [Distillation] base learn-to-init llama attention for distillation Apr 20, 2026

vlad-karp added 3 commits April 20, 2026 20:45

added a unit test

996bd01

get LTI student init-time teacher shapes from config

cc2568f

Merge branch 'main' into vladk/lti

5c1ac17

vlad-karp marked this pull request as ready for review April 20, 2026 21:50

vlad-karp requested review from dipannita08 and igorts-git as code owners April 20, 2026 21:50

vlad-karp added 2 commits April 20, 2026 23:18

unit test mock fixes

68a2637

code style fixes

24e7b84

entrpn reviewed Apr 21, 2026

View reviewed changes

Comment thread src/maxtext/models/learn_to_init_layer.py Outdated

Comment thread src/maxtext/models/learn_to_init_layer.py Outdated

entrpn reviewed Apr 21, 2026

View reviewed changes

Comment thread src/maxtext/models/llama2.py

vlad-karp added 4 commits April 21, 2026 20:34

code style fixes + misc

96341a2

typo fix

e7bda05

another pylint fix

282062c

another pylint fix

16c8f2c

JamesDeng42 reviewed Apr 21, 2026

View reviewed changes

Comment thread src/maxtext/trainers/post_train/distillation/distillation_utils.py Outdated

entrpn approved these changes Apr 22, 2026

View reviewed changes

vlad-karp added 2 commits April 22, 2026 10:12

Merge branch 'main' into vladk/lti

313f11e

merge fix

9fe1244

gagika reviewed Apr 22, 2026

View reviewed changes

Comment thread src/maxtext/trainers/post_train/distillation/distillation_utils.py Outdated

gagika reviewed Apr 22, 2026

View reviewed changes

Comment thread src/maxtext/layers/learn_to_init_layer.py

vlad-karp added 2 commits April 22, 2026 22:20

refactored LTI utils

95aa013

fixed student parameters sharing

351c234

gagika reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Distillation] base learn-to-init llama attention for distillation#3688

[Distillation] base learn-to-init llama attention for distillation#3688
vlad-karp wants to merge 17 commits intomainfrom
vladk/lti

vlad-karp commented Apr 17, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gagika Apr 23, 2026

Uh oh!

gagika Apr 23, 2026

Uh oh!

gagika Apr 23, 2026

Uh oh!

gagika Apr 23, 2026

Uh oh!

gagika Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vlad-karp commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gagika Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gagika Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gagika Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gagika Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gagika Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vlad-karp commented Apr 17, 2026 •

edited

Loading

codecov Bot commented Apr 17, 2026 •

edited

Loading