Adam: Adaptive Moment Estimation¶

Adam (Adaptive Moment Estimation) computes per-parameter adaptive learning rates from the first and second gradient moments. Adam combines the advantages of two other optimizers: AdaGrad, which adapts the learning rate to the parameters, and RMSProp, which uses a moving average of squared gradients to set per-parameter learning rates. Adam also introduces bias-corrected estimates of the first and second gradient averages.

Adam was introduced by Diederik Kingma and Jimmy Ba in Adam: A Method for Stochastic Optimization.

Hyperparameters¶

optimi sets the default \(\beta\)s to (0.9, 0.99) and default \(\epsilon\) to 1e-6. These values reflect current best-practices and usually outperform the PyTorch defaults.

If training on large batch sizes or observing training loss spikes, consider reducing \(\beta_2\) between \([0.95, 0.99)\).

optimi’s implementation of Adam combines Adam with both AdamW decouple_wd=True and Adam with fully decoupled weight decay decouple_lr=True. Weight decay will likely need to be reduced when using fully decoupled weight decay as the learning rate will not modify the effective weight decay.

Adam ¶

Adam optimizer. Optionally with decoupled weight decay (AdamW).

Parameters:

Name	Type	Description	Default
`params`	`Iterable[Tensor] \| Iterable[dict]`	Iterable of parameters to optimize or dicts defining parameter groups	required
`lr`	`float`	Learning rate	required
`betas`	`tuple[float, float]`	Coefficients for gradient and squared gradient moving averages (default: (0.9, 0.99))	`(0.9, 0.99)`
`weight_decay`	`float`	Weight decay coefficient. If `decouple_wd` and `decouple_lr` are False, applies L2 penalty (default: 0)	`0`
`eps`	`float`	Added to denominator to improve numerical stability (default: 1e-6)	`1e-06`
`decouple_wd`	`bool`	Apply decoupled weight decay instead of L2 penalty (default: False)	`False`
`decouple_lr`	`bool`	Apply fully decoupled weight decay instead of L2 penalty (default: False)	`False`
`max_lr`	`float \| None`	Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True (default: None)	`None`
`kahan_sum`	`bool \| None`	Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None)	`None`
`foreach`	`bool \| None`	Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None)	`None`
`triton`	`bool \| None`	Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None)	`None`
`gradient_release`	`bool`	Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure (default: False)	`False`

Algorithm¶

Adam with L2 regularization.

\[ \begin{aligned} &\rule{100mm}{0.4pt}\\ &\hspace{2mm} \textbf{Adam}\\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\ &\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}; \: \epsilon \: \text{(epsilon)}\\ &\hspace{5mm} \text{initialize} : \bm{m}_{0} \leftarrow \bm{0}; \: \bm{v}_{0} \leftarrow \bm{0}\\[-0.5em] &\rule{100mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}\text{:}\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1}) - \lambda\bm{\theta}_{t-1}\\[0.5em] &\hspace{10mm} \bm{m}_t \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\ &\hspace{10mm} \bm{v}_t \leftarrow \beta_2 \bm{v}_{t-1} + (1 - \beta_2) \bm{g}^2_t\\[0.5em] &\hspace{10mm} \hat{\bm{m}}_t \leftarrow \bm{m}_t/(1 - \beta_1^t)\\ &\hspace{10mm} \hat{\bm{v}}_t \leftarrow \bm{v}_t/(1 - \beta_2^t)\\[0.5em] &\hspace{10mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t \bigl( \hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \epsilon) \bigr)\\[-0.5em] &\rule{100mm}{0.4pt}\\ \end{aligned} \]

optimi’s Adam also supports AdamW’s decoupled weight decay and fully decoupled weight decay, which are not shown.