Make ScalelessRMSNorm a torch.nn.RMSNorm with frozen ones weight

navsud · facebook-github-bot · commit 09001047033f · 2026-05-07T11:37:12.000-07:00
Summary:
ScalelessRMSNorm was previously implemented as a hand-rolled RMS normalization
(decomposed into mean / rsqrt / mul). On the QNN export path, this decomposition
fails to lower. Using torch.nn.RMSNorm() directly works.

Re-implemented ScalelessRMSNorm as a torch.nn.RMSNorm subclass whose weight is
hardcoded to ones and frozen (requires_grad=False). This keeps the public
interface (ScalelessRMSNorm(dim, eps)) unchanged while letting backends see a
proper RMSNorm op so it lowers to QNN correctly.

Reviewed By: billmguo

Differential Revision: D104258950
diff --git a/examples/models/llama/norm.py b/examples/models/llama/norm.py
@@ -41,17 +41,20 @@ def forward(self, x):
         return output * self.weight.type_as(x)
 
 
-class ScalelessRMSNorm(torch.nn.Module):
+class ScalelessRMSNorm(torch.nn.RMSNorm):
+    """RMSNorm with weight hardcoded to ones and not trainable.
+
+    Equivalent to a scaleless RMSNorm (no learnable scaling) but implemented as a
+    torch.nn.RMSNorm so the op composes/decomposes cleanly for backends like QNN
+    instead of being expressed as a hand-rolled decomposition.
+    """
+
     def __init__(self, dim: int, eps: float = 1e-6):
-        super().__init__()
+        super().__init__(dim, eps)
         self.dim = dim
-        self.eps = eps
-
-    def forward(self, x):
-        x_float = x.float()
-        return (
-            x_float * torch.rsqrt((x_float * x_float).mean(-1, keepdim=True) + self.eps)
-        ).type_as(x)
+        with torch.no_grad():
+            self.weight.fill_(1.0)
+        self.weight.requires_grad = False
 
 
 class RMSNormWithInputScale(torch.nn.Module):