Skip to content

Lion: EvoLved Sign Momentum

Lion only keeps track of the gradient moving average (momentum) which can reduce memory usage compared to AdamW. Lion uses two momentum EMA factors, one for tracking momentum and another for using momentum in the update step. Using default hyperparameters, this allows up to ten times longer history for momentum tracking while leveraging more of the current gradient for the model update. Unlike most optimizers, Lion uses the same magnitude for each parameter update calculated using the sign operation.

Lion was introduced by Chen et al in Symbolic Discovery of Optimization Algorithms.

Hyperparameters

Hyperparameter notes from Chen et al:

  1. Due to the larger update norm from the sign operation, a good Lion learning rate is typically 3-10X smaller than AdamW.
  2. Since the effective weight decay is multiplied by the learning rate1, weight decay should be increased by the learning rate decrease (3-10X).
  3. Except for language modeling, \(\beta\)s are set to (0.9, 0.99). When training T5, Chen at al set \(\beta_1=0.95\) and \(\beta_2=0.98\). Reducing \(\beta_2\) results in better training stability due to less historical memorization.
  4. The optimal batch size for Lion is 4096 (vs AdamW’s 256), but Lion still performs well at a batch size of 64 and matches or exceeds AdamW on all tested batch sizes.

optimi’s implementation of Lion also supports fully decoupled weight decay decouple_lr=True. If using fully decoupled weight decay do not increase weight decay. Rather, weight decay will likely need to be reduced as the learning rate will not modify the effective weight decay.

Lion

Lion optimizer. Evolved Sign Momentum.

Parameters:

Name Type Description Default
params Iterable[Tensor] | Iterable[dict]

Iterable of parameters to optimize or dicts defining parameter groups

required
lr float

Learning rate

required
betas tuple[float, float]

Coefficients for update moving average and gradient moving average (default: (0.9, 0.99))

(0.9, 0.99)
weight_decay float

Weight decay coefficient. If decouple_lr is False, applies decoupled weight decay (default: 0)

0
decouple_lr bool

Apply fully decoupled weight decay instead of decoupled weight decay (default: False)

False
max_lr float | None

Maximum scheduled learning rate. Set if lr is not the maximum scheduled learning rate and decouple_lr is True (default: None)

None
kahan_sum bool | None

Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None)

None
foreach bool | None

Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it is significantly faster (default: None)

None
gradient_release bool

Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with register_gradient_release. Incompatible with closure (default: False)

False

Algorithm

Lion: Evolved Sign Momentum.

\[ \begin{aligned} &\rule{100mm}{0.4pt}\\ &\hspace{2mm} \textbf{Lion} \\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\ &\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}\\ &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}\\[-0.5em] &\rule{100mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}\text{:}\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\[0.5em] &\hspace{10mm} \bm{u} \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\ &\hspace{10mm} \bm{m}_t \leftarrow \beta_2 \bm{m}_{t-1} + (1 - \beta_2) \bm{g}_t\\[0.5em] &\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t \bigl(\text{sign}(\bm{u}) + \lambda\bm{\theta}_{t-1} \bigr)\\[-0.5em] &\rule{100mm}{0.4pt}\\ \end{aligned} \]

optimi’s Lion also supports fully decoupled weight decay, which is not shown.


  1. The learning rate does not modify the effective weight decay when using fully decoupled weight decay.