Which Optimizer Should I Use?¶

This guide is meant to provide a quick overview for choosing an optimi optimizer.

All optimi optimizers support training in pure BFloat16 precision¹ using Kahan summation, which can help match Float32 optimizer performance with mixed precision while reducing memory usage and increasing training speed.

Tried and True¶

There’s a reason AdamW is the default optimizer of deep learning. It performs well across multiple domains, model architectures, and batch sizes. Most optimizers claiming to outperform AdamW usually do not after careful analysis and experimentation.

Consider reducing the \(\beta_2\) term if training on large batch sizes or observing training loss spikes².

Drop-in Replacement¶

If using gradient clipping during training or experience training loss spikes, try replacing AdamW with StableAdamW. StableAdamW applies AdaFactor style update clipping to AdamW, stabilizing training loss and removing the need for gradient clipping.

StableAdamW can outperform AdamW with gradient clipping on downstream tasks.

Low Memory Usage¶

If optimizer memory usage is important and optimi’s Kahan summation doesn’t alleviate optimizer memory usage or even more memory savings are desired, try optimi’s two low memory optimizers: Lion and SGD.

Lion uses one memory buffer for both momentum and the update step, reducing memory usage compared to AdamW. While reviews are mixed, Lion can match AdamW in some training scenarios.

Prior to Adam and AdamW, SGD was the default optimizer for deep learning. SGD with Momentum can match or outperform AdamW on some tasks but can require more hyperparameter tuning. Consider using SGD with decoupled weight decay, it can lead to better results than L2 regularization.

Potential Upgrade¶

Adan can outperform AdamW at the expense of extra memory usage due to using two more buffers then AdamW. Consider trying Adan if optimizer memory usage isn’t a priority, or when finetuning.

Small Batch CNN¶

Ranger can outperform AdamW when training or finetuning on small batch sizes (~512 or less) with convolutional neural networks. It does use one more buffer then AdamW. Ranger performs best with a flat learning rate followed by a short learning rate decay.

Or BFloat16 with normalization and RoPE layers in Float32. ↩
This setting is mentioned in Sigmoid Loss for Language Image Pre-Training, although it is common knowledge in parts of the deep learning community. ↩