Ranger: RAdam and LookAhead¶

Ranger combines RAdam and Lookahead together in one optimizer. RAdam fixes the adaptive learning rate's large variance during early stages of training to improve convergence and reducing the need for warmup. Lookahead updates model weights like normal, but every k steps interpolates them with a copy of slow moving weights. This moving average of the model weights is less sensitive to suboptimal hyperparameters and reduces the need for hyperparameter tuning.

Ranger was introduced by Less Wright in New Deep Learning Optimizer, Ranger: Synergistic combination of RAdam + Lookahead for the best of both.

Hyperparameters¶

Ranger works best with a flat learning rate followed by a short learning rate decay. Try raising the learning rate 2-3x larger than AdamW.

optimi sets the default \(\beta\)s to (0.9, 0.99) and default \(\epsilon\) to 1e-6. These values reflect current best-practices and usually outperform the PyTorch defaults.

optimi’s implementation of Ranger supports both decoupled weight decay decouple_wd=True and fully decoupled weight decay decouple_lr=True. Weight decay will likely need to be reduced when using fully decoupled weight decay as the learning rate will not modify the effective weight decay.

Ranger ¶

Ranger optimizer. RAdam with Lookahead.

Parameters:

Name	Type	Description	Default
`params`	`Iterable[Tensor] \| Iterable[dict]`	Iterable of parameters to optimize or dicts defining parameter groups	required
`lr`	`float`	Learning rate	required
`betas`	`tuple[float, float]`	Coefficients for gradient and squared gradient moving averages (default: (0.9, 0.99))	`(0.9, 0.99)`
`weight_decay`	`float`	Weight decay coefficient. If `decouple_wd` and `decouple_lr` are False, applies L2 penalty (default: 0)	`0`
`eps`	`float`	Added to denominator to improve numerical stability (default: 1e-6)	`1e-06`
`k`	`int`	Lookahead synchronization period (default: 6)	`6`
`alpha`	`float`	Lookahead weight interpolation coefficient (default: 0.5)	`0.5`
`decouple_wd`	`bool`	Apply decoupled weight decay instead of L2 penalty (default: True)	`True`
`decouple_lr`	`bool`	Apply fully decoupled weight decay instead of L2 penalty (default: False)	`False`
`max_lr`	`float \| None`	Maximum scheduled learning rate. Set if `lr` is not the maximum scheduled learning rate and `decouple_lr` is True (default: None)	`None`
`kahan_sum`	`bool \| None`	Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None)	`None`
`foreach`	`bool \| None`	Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None)	`None`
`triton`	`bool \| None`	Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None)	`None`
`gradient_release`	`bool`	Fuses optimizer step and zero_grad as part of the parameter's backward pass. Requires model hooks created with `register_gradient_release`. Incompatible with closure (default: False)	`False`

Algorithm¶

Ranger: RAdam and LookAhead.

\[ \begin{aligned} &\rule{100mm}{0.4pt}\\ &\hspace{2mm} \textbf{Ranger: RAdam and} \: \textcolor{#9a3fe4}{\textbf{LookAhead}}\\ &\hspace{5mm} \text{inputs} : \bm{\theta}_0 \: \text{(params)}; \: f(\bm{\theta}) \text{(objective)}; \: \gamma_t \:\text{(learning rate at } t \text{)}; \\ &\hspace{17.25mm} \beta_1, \beta_2 \: \text{(betas)}; \: \lambda \: \text{(weight decay)}; \: \epsilon \: \text{(epsilon)};\\ &\hspace{17.25mm} \bm{\phi}_0 \: \text{(slow params)}; \: k \: \text{(sync)}; \: \alpha \: \text{(interpolation)};\\ &\hspace{5mm} \text{initialize} : \textcolor{#9a3fe4}{\bm{\phi}_0 \leftarrow \bm{\theta}_0}; \: \bm{m}_{0} \leftarrow \bm{0}; \: \bm{v}_{0} \leftarrow \bm{0}; \: \rho_{\infty} \leftarrow 2 / (1 - \beta_2) - 1;\\[-0.5em] &\rule{100mm}{0.4pt}\\ &\hspace{5mm} \textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do}\text{:}\\ &\hspace{10mm} \bm{g}_t \leftarrow \nabla_{\theta} f_t(\bm{\theta}_{t-1})\\[0.5em] &\hspace{10mm} \bm{m}_t \leftarrow \beta_1 \bm{m}_{t-1} + (1 - \beta_1) \bm{g}_t\\ &\hspace{10mm} \bm{v}_t \leftarrow \beta_2 \bm{v}_{t-1} + (1 - \beta_2) \bm{g}^2_t\\[0.5em] &\hspace{10mm} \hat{\bm{m}}_t \leftarrow \bm{m}_t/(1 - \beta_1^t)\\ &\hspace{10mm} \hat{\bm{v}}_t \leftarrow \bm{v}_t/(1 - \beta_2^t)\\[0.5em] &\hspace{10mm} \rho_t \leftarrow \rho_{\infty} - 2 t \beta^t_2 /(1 - \beta_2^t)\\[0.5em] &\hspace{10mm} \textbf{if} \: \rho_t > 5\text{:}\\ &\hspace{15mm} r_t \leftarrow \sqrt{\tfrac{(\rho_t - 4)(\rho_t - 2)\rho_{\infty}}{(\rho_{\infty} - 4)(\rho_{\infty} -2 ) \rho_t}}\\ &\hspace{15mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t r_t \bigl( \hat{\bm{m}}_t / (\sqrt{\hat{\bm{v}}_t} + \epsilon) + \lambda\bm{\theta}_{t-1} \bigr)\\ &\hspace{10mm} \textbf{else}\text{:}\\ &\hspace{15mm} \bm{\theta}_t \leftarrow \bm{\theta}_{t-1} - \gamma_t (\hat{\bm{m}}_t + \lambda\bm{\theta}_{t-1})\\[0.5em] &\hspace{10mm} \textcolor{#9a3fe4}{\textbf{if} \: t \equiv 0 \pmod{k}\text{:}}\\ &\hspace{15mm} \textcolor{#9a3fe4}{\bm{\phi}_t \leftarrow \bm{\phi}_{t-k} + \alpha(\bm{\theta}_t - \bm{\phi}_{t-k} )}\\ &\hspace{15mm} \textcolor{#9a3fe4}{\bm{\theta}_t \leftarrow \bm{\phi}_t}\\[-0.5em] &\rule{100mm}{0.4pt}\\ \end{aligned} \]

optimi’s Ranger also supports L2 regularization and fully decoupled weight decay, which are not shown.