Skip to content

ForEach Optimizer Implementations

Like PyTorch, optimi supports foreach implementations of all optimizers. Foreach optimizers can be significantly faster than the for-loop versions.

Foreach implementations can increase optimizer peak memory usage. optimi attempts to reduce this extra overhead by reusing the gradient buffer for temporary variables. If the gradients are required between the optimization step and gradient reset step, set foreach=False to use the for-loop implementation.

Note: Foreach Requires PyTorch 2.1+

optimi’s foreach implementations require PyTorch 2.1 or newer.

If unspecified foreach=None, optimi will use the foreach implementation if training on a Cuda device.

Example

Using a foreach implementation is as simple as calling

import torch
from torch import nn
from optimi import AdamW

# create model
model = nn.Linear(20, 1, device='cuda')

# initialize any optmi optimizer with `foreach=True`
# models on a cuda device will default to `foreach=True`
opt = AdamW(model.parameters(), lr=1e-3, foreach=True)

# forward and backward
loss = model(torch.randn(20))
loss.backward()

# optimizer step is the foreach implementation
opt.step()
opt.zero_grad()