Skip to content

Adan: ADAptive Nesterov Momentum

Adan uses a efficient Nesterov momentum estimation method to avoid the extra computation and memory overhead of calculating the extrapolation point gradient. In contrast to other Nesterov momentum estimating optimizers, Adan estimates both the first- and second-order gradient movements. This estimation requires two additional buffers over AdamW, increasing memory usage.

Adan was introduced by Xie et al in Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models.

Hyperparameters

Hyperparameter notes from Xie et al:

  1. \(\beta_2\) is the least sensitive Adan hyperparameter, default of 0.92 works for majority of tasks
  2. Xie et al primarily tune \(\beta_3\) (between 0.9-0.999) before \(\beta_1\) (between 0.9-0.98) for different tasks
  3. Adan pairs well with large learning rates. Paper and GitHub report up to 3x larger than Lamb and up to 5-10x larger than AdamW
  4. Xie et al use the default weight decay of 0.02 for all tasks except fine-tuning BERT (0.01) and reinforcement learning (0)

optimi’s implementation of Adan also supports fully decoupled weight decay decouple_lr=True. The default weight decay of 0.02 will likely need to be reduced when using fully decoupled weight decay as the learning rate will not modify the effective weight decay.

Adan

Adan Optimizer: Adaptive Nesterov Momentum Algorithm.

Parameters:

Name Type Description Default
params Iterable[Tensor] | Iterable[dict]

Iterable of parameters to optimize or dicts defining parameter groups

required
lr float

Learning rate

required
betas tuple[float, float, float]

Coefficients for gradient, gradient difference, and squared gradient moving averages (default: (0.98, 0.92, 0.99))

(0.98, 0.92, 0.99)
weight_decay float

Weight decay coefficient. If decouple_lr is False, applies decoupled weight decay (default: 2e-2)

0.02
eps float

Added to denominator to improve numerical stability (default: 1e-6)

1e-06
decouple_lr bool

Apply fully decoupled weight decay instead of decoupled weight decay (default: False)

False
max_lr float | None

Maximum scheduled learning rate. Set if lr is not the maximum scheduled learning rate and decouple_lr is True (default: None)

None
adam_wd bool

Apply weight decay before parameter update (Adam-style), instead of after the update per Adan algorithm (default: False)

False
kahan_sum bool | None

Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None)

False
foreach bool | None

Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it is significantly faster (default: None)

None
gradient_release bool

Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with register_gradient_release. Incompatible with closure (default: False)

False
Note: Adan in bfloat16 is Noisier then Other Optimizers

Even with Kahan summation, training with Adan in bfloat16 results in noisier updates relative to float32 or mixed precision training than other optimizers.

Algorithm

Adan: Adaptive Nesterov Momentum.

\[ \begin{align*} &\rule{100mm}{0.4pt}\\ &\hspace{2mm} \textbf{Adan} \\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\ &\hspace{17.25mm} \beta_1, \beta_2, \beta_3 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}; \: \epsilon \: \text{(epsilon)}\\ &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}; \: \bm{v}_{0} \leftarrow \bm{0}; \: \bm{n}_{0} \leftarrow \bm{0}\\[-0.5em] &\rule{100mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}\text{:}\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\[0.5em] &\hspace{10mm} \bm{m}_t \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\ &\hspace{10mm} \bm{v}_t \leftarrow \beta_2 \bm{v}_{t-1} + (1 - \beta_2) (\bm{g}_t - \bm{g}_{t-1})\\ &\hspace{10mm} \bm{n}_t \leftarrow \beta_3 \bm{n}_{t-1} + (1 - \beta_3)\bigl(\bm{g}_t + \beta_2(\bm{g}_t - \bm{g}_{t-1})\bigr)^2\\[0.5em] &\hspace{10mm} \hat{\bm{m}}_t \leftarrow \bm{m}_t/(1 - \beta_1^t)\\ &\hspace{10mm} \hat{\bm{v}}_t \leftarrow \bm{v}_t/(1 - \beta_2^t)\\ &\hspace{10mm} \hat{\bm{n}}_t \leftarrow \bm{n}_t/(1 - \beta_3^t)\\[0.5em] &\hspace{10mm} \bm{\eta}_t \leftarrow \gamma_t/(\sqrt{\hat{\bm{n}}_t} + \epsilon)\\ &\hspace{10mm} \bm{\theta}_t \leftarrow (1+\gamma_t\lambda )^{-1}\bigl(\bm{\theta}_{t-1} - \bm{\eta}_t (\hat{\bm{m}}_t + \beta_2\hat{\bm{v}}_t)\bigr)\\[-0.5em] &\rule{100mm}{0.4pt}\\ \end{align*} \]

During the first step, \(\bm{g}_t - \bm{g}_{t-1}\) is set to \(\bm{0}\).

optimi’s Adan also supports Adam-style weight decay and fully decoupled weight decay, both which are not shown.