Adan: ADAptive Nesterov Momentum¶
Adan uses a efficient Nesterov momentum estimation method to avoid the extra computation and memory overhead of calculating the extrapolation point gradient. In contrast to other Nesterov momentum estimating optimizers, Adan estimates both the first- and second-order gradient movements. This estimation requires two additional buffers over AdamW, increasing memory usage.
Adan was introduced by Xie et al in Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models.
Hyperparameters¶
Hyperparameter notes from Xie et al:
- \(\beta_2\) is the least sensitive Adan hyperparameter, default of 0.92 works for majority of tasks
- Xie et al primarily tune \(\beta_3\) (between 0.9-0.999) before \(\beta_1\) (between 0.9-0.98) for different tasks
- Adan pairs well with large learning rates. Paper and GitHub report up to 3x larger than Lamb and up to 5-10x larger than AdamW
- Xie et al use the default weight decay of 0.02 for all tasks except fine-tuning BERT (0.01) and reinforcement learning (0)
optimi’s implementation of Adan also supports fully decoupled weight decay decouple_lr=True
. The default weight decay of 0.02 will likely need to be reduced when using fully decoupled weight decay as the learning rate will not modify the effective weight decay.
Adan ¶
Adan Optimizer: Adaptive Nesterov Momentum Algorithm.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
params |
Iterable[Tensor] | Iterable[dict]
|
Iterable of parameters to optimize or dicts defining parameter groups |
required |
lr |
float
|
Learning rate |
required |
betas |
tuple[float, float, float]
|
Coefficients for gradient, gradient difference, and squared gradient moving averages (default: (0.98, 0.92, 0.99)) |
(0.98, 0.92, 0.99)
|
weight_decay |
float
|
Weight decay coefficient. If |
0.02
|
eps |
float
|
Added to denominator to improve numerical stability (default: 1e-6) |
1e-06
|
decouple_lr |
bool
|
Apply fully decoupled weight decay instead of decoupled weight decay (default: False) |
False
|
max_lr |
float | None
|
Maximum scheduled learning rate. Set if |
None
|
adam_wd |
bool
|
Apply weight decay before parameter update (Adam-style), instead of after the update per Adan algorithm (default: False) |
False
|
kahan_sum |
bool | None
|
Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None) |
False
|
foreach |
bool | None
|
Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it is significantly faster (default: None) |
None
|
gradient_release |
bool
|
Fuses optimizer step and zero_grad as part of the parameter's backward
pass. Requires model hooks created with |
False
|
Note: Adan in bfloat16 is Noisier then Other Optimizers
Even with Kahan summation, training with Adan in bfloat16 results in noisier updates relative to float32 or mixed precision training than other optimizers.
Algorithm¶
Adan: Adaptive Nesterov Momentum.
During the first step, \(\bm{g}_t - \bm{g}_{t-1}\) is set to \(\bm{0}\).
optimi’s Adan also supports Adam-style weight decay and fully decoupled weight decay, both which are not shown.