Adam: Adaptive Moment Estimation¶
Adam (Adaptive Moment Estimation) computes per-parameter adaptive learning rates from the first and second gradient moments. Adam combines the advantages of two other optimizers: AdaGrad, which adapts the learning rate to the parameters, and RMSProp, which uses a moving average of squared gradients to set per-parameter learning rates. Adam also introduces bias-corrected estimates of the first and second gradient averages.
Adam was introduced by Diederik Kingma and Jimmy Ba in Adam: A Method for Stochastic Optimization.
Hyperparameters¶
optimi sets the default \(\beta\)s to (0.9, 0.99)
and default \(\epsilon\) to 1e-6
. These values reflect current best-practices and usually outperform the PyTorch defaults.
If training on large batch sizes or observing training loss spikes, consider reducing \(\beta_2\) between \([0.95, 0.99)\).
optimi’s implementation of Adam combines Adam with both AdamW decouple_wd=True
and Adam with fully decoupled weight decay decouple_lr=True
. Weight decay will likely need to be reduced when using fully decoupled weight decay as the learning rate will not modify the effective weight decay.
Adam ¶
Adam optimizer. Optionally with decoupled weight decay (AdamW).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
params |
Iterable[Tensor] | Iterable[dict]
|
Iterable of parameters to optimize or dicts defining parameter groups |
required |
lr |
float
|
Learning rate |
required |
betas |
tuple[float, float]
|
Coefficients for gradient and squared gradient moving averages (default: (0.9, 0.99)) |
(0.9, 0.99)
|
weight_decay |
float
|
Weight decay coefficient. If |
0
|
eps |
float
|
Added to denominator to improve numerical stability (default: 1e-6) |
1e-06
|
decouple_wd |
bool
|
Apply decoupled weight decay instead of L2 penalty (default: False) |
False
|
decouple_lr |
bool
|
Apply fully decoupled weight decay instead of L2 penalty (default: False) |
False
|
max_lr |
float | None
|
Maximum scheduled learning rate. Set if |
None
|
kahan_sum |
bool | None
|
Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None) |
None
|
foreach |
bool | None
|
Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it is significantly faster (default: None) |
None
|
gradient_release |
bool
|
Fuses optimizer step and zero_grad as part of the parameter's backward
pass. Requires model hooks created with |
False
|
Algorithm¶
Adam with L2 regularization.
optimi’s Adam also supports AdamW’s decoupled weight decay and fully decoupled weight decay, which are not shown.