Lion: EvoLved Sign Momentum¶
Lion only keeps track of the gradient moving average (momentum) which can reduce memory usage compared to AdamW. Lion uses two momentum EMA factors, one for tracking momentum and another for using momentum in the update step. Using default hyperparameters, this allows up to ten times longer history for momentum tracking while leveraging more of the current gradient for the model update. Unlike most optimizers, Lion uses the same magnitude for each parameter update calculated using the sign operation.
Lion was introduced by Chen et al in Symbolic Discovery of Optimization Algorithms.
Hyperparameters¶
Hyperparameter notes from Chen et al:
- Due to the larger update norm from the sign operation, a good Lion learning rate is typically 3-10X smaller than AdamW.
- Since the effective weight decay is multiplied by the learning rate1, weight decay should be increased by the learning rate decrease (3-10X).
- Except for language modeling, \(\beta\)s are set to
(0.9, 0.99)
. When training T5, Chen at al set \(\beta_1=0.95\) and \(\beta_2=0.98\). Reducing \(\beta_2\) results in better training stability due to less historical memorization. - The optimal batch size for Lion is 4096 (vs AdamW’s 256), but Lion still performs well at a batch size of 64 and matches or exceeds AdamW on all tested batch sizes.
optimi’s implementation of Lion also supports fully decoupled weight decay decouple_lr=True
. If using fully decoupled weight decay do not increase weight decay. Rather, weight decay will likely need to be reduced as the learning rate will not modify the effective weight decay.
Lion ¶
Lion optimizer. Evolved Sign Momentum.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
params |
Iterable[Tensor] | Iterable[dict]
|
Iterable of parameters to optimize or dicts defining parameter groups |
required |
lr |
float
|
Learning rate |
required |
betas |
tuple[float, float]
|
Coefficients for update moving average and gradient moving average (default: (0.9, 0.99)) |
(0.9, 0.99)
|
weight_decay |
float
|
Weight decay coefficient. If |
0
|
decouple_lr |
bool
|
Apply fully decoupled weight decay instead of decoupled weight decay (default: False) |
False
|
max_lr |
float | None
|
Maximum scheduled learning rate. Set if |
None
|
kahan_sum |
bool | None
|
Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None) |
None
|
foreach |
bool | None
|
Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it is significantly faster (default: None) |
None
|
gradient_release |
bool
|
Fuses optimizer step and zero_grad as part of the parameter's backward
pass. Requires model hooks created with |
False
|
Algorithm¶
Lion: Evolved Sign Momentum.
optimi’s Lion also supports fully decoupled weight decay, which is not shown.
-
The learning rate does not modify the effective weight decay when using fully decoupled weight decay. ↩