Skip to content

Standardize weight_decay_mask naming and add weight decay#1635

Open
rdyro wants to merge 2 commits intogoogle-deepmind:mainfrom
rdyro:alias-weight-decay-unify
Open

Standardize weight_decay_mask naming and add weight decay#1635
rdyro wants to merge 2 commits intogoogle-deepmind:mainfrom
rdyro:alias-weight-decay-unify

Conversation

@rdyro
Copy link
Copy Markdown
Collaborator

@rdyro rdyro commented Mar 20, 2026

Test PR co-written with Gemini.

Addresses inconsistencies where the mask parameter for weight decay was named mask instead of weight_decay_mask in optimizers like adamw, nadamw, adan, lion, lamb, and adamaxw. Additionally, missing decoupled weight decay support (with weight_decay and weight_decay_mask) was added to multiple base optimizers including adabelief, adagrad, amsgrad, radam, rmsprop, sgd, sm3, yogi, optimistic_gradient_descent, optimistic_adam, and optimistic_adam_v2. The novograd alias continues to use its internal decoupled approach.

Optimizers With Weight Decay

Historically designed with decoupled weight decay explicitly:

  • adamw
  • nadamw
  • adamaxw

Already have weight decay added:

  • adan
  • lion
  • lamb
  • adadelta
  • adafactor (named weight_decay_rate)

Newly updated to standard decoupled weight decay:

  • adabelief
  • adagrad
  • amsgrad
  • radam
  • rmsprop
  • sgd
  • sm3
  • yogi
  • optimistic_gradient_descent
  • optimistic_adam
  • optimistic_adam_v2

Applying weight decay inside their specific transformation:

  • novograd
  • lars

Optimizers Without Weight Decay

Left without weight decay to serve as unmodified algorithms:
Kept for backward compatibility to match their "W" counterparts natively.

  • adam
  • nadam
  • adamax

Modifies gradient signs (norm agnostic):
Weight decay could behave unpredictably since gradient magnitude is ignored.

  • sign_sgd
  • signum

Niche algorithms or raw-gradient wrappers:
Weight decay is not a standard practice for these methods.

  • noisy_sgd
  • rprop
  • polyak_sgd
  • lbfgs
  • fromage

…ight decay to optimizers

Addresses inconsistencies where the mask parameter for weight decay was named `mask` instead of `weight_decay_mask` in optimizers like adamw, nadamw, adan, lion, lamb, and adamaxw. Additionally, missing decoupled weight decay support (with `weight_decay` and `weight_decay_mask`) was added to multiple base optimizers including adabelief, adagrad, amsgrad, radam, rmsprop, sgd, sm3, yogi, optimistic_gradient_descent, optimistic_adam, and optimistic_adam_v2. The novograd alias continues to use its internal decoupled approach.
@rdyro rdyro force-pushed the alias-weight-decay-unify branch from 670f1d3 to 6ed973a Compare March 20, 2026 05:09
@rdyro rdyro force-pushed the alias-weight-decay-unify branch from 988573e to 4cf37de Compare March 20, 2026 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant