Lion: EvoLved Sign Momentum¶

Lion only keeps track of the gradient moving average (momentum) which can reduce memory usage compared to AdamW. Lion uses two momentum EMA factors, one for tracking momentum and another for using momentum in the update step. Using default hyperparameters, this allows up to ten times longer history for momentum tracking while leveraging more of the current gradient for the model update. Unlike most optimizers, Lion uses the same magnitude for each parameter update calculated using the sign operation.

Lion was introduced by Chen et al in Symbolic Discovery of Optimization Algorithms.

Hyperparameters¶

Hyperparameter notes from Chen et al:

Due to the larger update norm from the sign operation, a good Lion learning rate is typically 3-10X smaller than AdamW.
Since the effective weight decay is multiplied by the learning rate¹, weight decay should be increased by the learning rate decrease (3-10X).
Except for language modeling, \(\beta\)s are set to (0.9, 0.99). When training T5, Chen at al set \(\beta_1=0.95\) and \(\beta_2=0.98\). Reducing \(\beta_2\) results in better training stability due to less historical memorization.
The optimal batch size for Lion is 4096 (vs AdamW’s 256), but Lion still performs well at a batch size of 64 and matches or exceeds AdamW on all tested batch sizes.

optimi’s implementation of Lion also supports fully decoupled weight decay decouple_lr=True. If using fully decoupled weight decay do not increase weight decay. Rather, weight decay will likely need to be reduced as the learning rate will not modify the effective weight decay.

Lion ¶

Lion optimizer. Evolved Sign Momentum.

Parameters:

Name	Type	Description	Default
`params`	`Iterable[Tensor] \| Iterable[dict]`	Iterable of parameters to optimize or dicts defining parameter groups	required
`lr`	`float`	Learning rate	required
`betas`	`tuple[float, float]`	Coefficients for update moving average and gradient moving average (default: (0.9, 0.99))	`(0.9, 0.99)`
`weight_decay`	`float`	Weight decay coefficient. If `decouple_lr` is False, applies decoupled weight decay (default: 0)	`0`
`decouple_lr`	`bool`	Apply fully decoupled weight decay instead of decoupled weight decay (default: False)	`False`
`max_lr`	`float \| None`	Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True (default: None)	`None`
`kahan_sum`	`bool \| None`	Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None)	`None`
`foreach`	`bool \| None`	Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None)	`None`
`triton`	`bool \| None`	Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None)	`None`
`gradient_release`	`bool`	Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure (default: False)	`False`

Algorithm¶

Lion: Evolved Sign Momentum.

\[ \begin{aligned} &\rule{100mm}{0.4pt}\\ &\hspace{2mm} \textbf{Lion} \\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\ &\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}\\ &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}\\[-0.5em] &\rule{100mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}\text{:}\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\[0.5em] &\hspace{10mm} \bm{u} \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\ &\hspace{10mm} \bm{m}_t \leftarrow \beta_2 \bm{m}_{t-1} + (1 - \beta_2) \bm{g}_t\\[0.5em] &\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t \bigl(\text{sign}(\bm{u}) + \lambda\bm{\theta}_{t-1} \bigr)\\[-0.5em] &\rule{100mm}{0.4pt}\\ \end{aligned} \]

optimi’s Lion also supports fully decoupled weight decay, which is not shown.

The learning rate does not modify the effective weight decay when using fully decoupled weight decay. ↩