Commit 0f3877a
Init lora_B with small random values and restore logits gradient path
B=0 initialization creates a saddle point where the preservation gradient
is exactly zero at init, allowing the EAGLE logits gradient to dominate
unopposed before preservation can react. Initialize lora_B with N(0, 0.01)
so the preservation loss is active from step 0 and constrains LoRA from
the start.
With preservation active at init, restore the direct logits gradient path
(remove detach on base_outputs.logits in EAGLE loss) to give LoRA a strong
training signal while relying on preservation loss to prevent collapse.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>1 parent ec61f24 commit 0f3877a
1 file changed
Lines changed: 7 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
560 | 560 | | |
561 | 561 | | |
562 | 562 | | |
563 | | - | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
564 | 567 | | |
565 | 568 | | |
566 | 569 | | |
| 570 | + | |
| 571 | + | |
567 | 572 | | |
568 | 573 | | |
569 | 574 | | |
| |||
1017 | 1022 | | |
1018 | 1023 | | |
1019 | 1024 | | |
1020 | | - | |
1021 | | - | |
1022 | | - | |
1023 | | - | |
| 1025 | + | |
1024 | 1026 | | |
1025 | 1027 | | |
1026 | 1028 | | |
| |||
0 commit comments