Adan: ADAptive Nesterov Momentum¶

Adan uses a efficient Nesterov momentum estimation method to avoid the extra computation and memory overhead of calculating the extrapolation point gradient. In contrast to other Nesterov momentum estimating optimizers, Adan estimates both the first- and second-order gradient movements. This estimation requires two additional buffers over AdamW, increasing memory usage.

Adan was introduced by Xie et al in Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models.

Hyperparameters¶

Hyperparameter notes from Xie et al:

\(\beta_2\) is the least sensitive Adan hyperparameter, default of 0.92 works for majority of tasks
Xie et al primarily tune \(\beta_3\) (between 0.9-0.999) before \(\beta_1\) (between 0.9-0.98) for different tasks
Adan pairs well with large learning rates. Paper and GitHub report up to 3x larger than Lamb and up to 5-10x larger than AdamW
Xie et al use the default weight decay of 0.02 for all tasks except fine-tuning BERT (0.01) and reinforcement learning (0)

optimi’s implementation of Adan also supports fully decoupled weight decay decouple_lr=True. The default weight decay of 0.02 will likely need to be reduced when using fully decoupled weight decay as the learning rate will not modify the effective weight decay.

Adan ¶

Adan Optimizer: Adaptive Nesterov Momentum Algorithm.

Parameters:

Name	Type	Description	Default
`params`	`Iterable[Tensor] \| Iterable[dict]`	Iterable of parameters to optimize or dicts defining parameter groups	required
`lr`	`float`	Learning rate	required
`betas`	`tuple[float, float, float]`	Coefficients for gradient, gradient difference, and squared gradient moving averages (default: (0.98, 0.92, 0.99))	`(0.98, 0.92, 0.99)`
`weight_decay`	`float`	Weight decay coefficient. If `decouple_lr` is False, applies decoupled weight decay (default: 2e-2)	`0.02`
`eps`	`float`	Added to denominator to improve numerical stability (default: 1e-6)	`1e-06`
`decouple_lr`	`bool`	Apply fully decoupled weight decay instead of decoupled weight decay (default: False)	`False`
`max_lr`	`float \| None`	Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True (default: None)	`None`
`adam_wd`	`bool`	Apply weight decay before parameter update (Adam-style), instead of after the update per Adan algorithm (default: False)	`False`
`kahan_sum`	`bool \| None`	Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None)	`False`
`foreach`	`bool \| None`	Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None)	`None`
`triton`	`bool \| None`	Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None)	`None`
`gradient_release`	`bool`	Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure (default: False)	`False`

Note: Adan in bfloat16 is Noisier then Other Optimizers

Even with Kahan summation, training with Adan in bfloat16 results in noisier updates relative to float32 or mixed precision training than other optimizers.

Algorithm¶

Adan: Adaptive Nesterov Momentum.

\[ \begin{align*} &\rule{100mm}{0.4pt}\\ &\hspace{2mm} \textbf{Adan} \\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\ &\hspace{17.25mm} \beta_1, \beta_2, \beta_3 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}; \: \epsilon \: \text{(epsilon)}\\ &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}; \: \bm{v}_{0} \leftarrow \bm{0}; \: \bm{n}_{0} \leftarrow \bm{0}\\[-0.5em] &\rule{100mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}\text{:}\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\[0.5em] &\hspace{10mm} \bm{m}_t \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\ &\hspace{10mm} \bm{v}_t \leftarrow \beta_2 \bm{v}_{t-1} + (1 - \beta_2) (\bm{g}_t - \bm{g}_{t-1})\\ &\hspace{10mm} \bm{n}_t \leftarrow \beta_3 \bm{n}_{t-1} + (1 - \beta_3)\bigl(\bm{g}_t + \beta_2(\bm{g}_t - \bm{g}_{t-1})\bigr)^2\\[0.5em] &\hspace{10mm} \hat{\bm{m}}_t \leftarrow \bm{m}_t/(1 - \beta_1^t)\\ &\hspace{10mm} \hat{\bm{v}}_t \leftarrow \bm{v}_t/(1 - \beta_2^t)\\ &\hspace{10mm} \hat{\bm{n}}_t \leftarrow \bm{n}_t/(1 - \beta_3^t)\\[0.5em] &\hspace{10mm} \bm{\eta}_t \leftarrow \gamma_t/(\sqrt{\hat{\bm{n}}_t} + \epsilon)\\ &\hspace{10mm} \bm{\theta}_t \leftarrow (1+\gamma_t\lambda )^{-1}\bigl(\bm{\theta}_{t-1} - \bm{\eta}_t (\hat{\bm{m}}_t + \beta_2\hat{\bm{v}}_t)\bigr)\\[-0.5em] &\rule{100mm}{0.4pt}\\ \end{align*} \]

During the first step, \(\bm{g}_t - \bm{g}_{t-1}\) is set to \(\bm{0}\).

optimi’s Adan also supports Adam-style weight decay and fully decoupled weight decay, both which are not shown.