Skip to content

Fix LayerNorm Scaling implementation#698

Open
davidgonmar wants to merge 1 commit into
allenai:mainfrom
davidgonmar:fix-layernormscaling
Open

Fix LayerNorm Scaling implementation#698
davidgonmar wants to merge 1 commit into
allenai:mainfrom
davidgonmar:fix-layernormscaling

Conversation

@davidgonmar
Copy link
Copy Markdown

Issue

The current LayerNorm Scaling implementation has a silent bug when used in the training pipeline.

The scale is implemented as a buffer and initialized in the __init__ method of LayerNormScaledTransformerBlock. However, model initialization in src/olmo_core/nn/transformer/model.py calls to_empty before initializing each module’s weights individually. This effectively erases the contents of the ln_scale buffer and causes incorrect behavior during training.

Fix

There are two possible fixes. The first is to keep using a buffer and implement a custom initialization function. The second is to implement the scale as a Python scalar, which is simpler and is the approach taken in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant