Quick Start
We use ResNet-50 as an example to demonstrate how to accelerate the training with TorchAcc.
Torch Native Task
Below is the code for an ResNet-50 in Torch:
import time
import torch
import torchvision
batch_size = 64
log_steps = 20
inputs = torch.randn(6400, 3, 224, 224)
labels = torch.randint(0, 100, (6400,))
dataset = torch.utils.data.TensorDataset(inputs, labels)
train_loader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, shuffle=True, num_workers=4)
model = torchvision.models.resnet50()
num_ftrs = model.fc.in_features
model.fc = torch.nn.Linear(num_ftrs, 100)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
for epoch in range(4):
model.train()
start_time = time.time()
for i, (inputs, labels) in enumerate(train_loader, 1):
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
if i % log_steps == 0:
iteration_time = time.time() - start_time
throughputs = batch_size * log_steps / iteration_time
print(f'Epoch: {epoch}, Step: {i}, Loss: {loss:.4f}, Throughputs: {throughputs:.4f} samples/s')
start_time = time.time()
The results on an 80G A100 are as follows:
$ python resnet_native.py
Epoch: 0, Step: 20, Loss: 5.0722, Throughputs: 146.9960 samples/s
Epoch: 0, Step: 40, Loss: 5.6211, Throughputs: 933.5072 samples/s
Epoch: 0, Step: 60, Loss: 5.0938, Throughputs: 933.4178 samples/s
Epoch: 0, Step: 80, Loss: 4.8861, Throughputs: 931.6142 samples/s
Epoch: 0, Step: 100, Loss: 4.8116, Throughputs: 927.7186 samples/s
Epoch: 1, Step: 20, Loss: 4.6499, Throughputs: 777.6132 samples/s
Epoch: 1, Step: 40, Loss: 4.7558, Throughputs: 929.3011 samples/s
Epoch: 1, Step: 60, Loss: 4.6438, Throughputs: 923.1462 samples/s
Epoch: 1, Step: 80, Loss: 4.6413, Throughputs: 933.8570 samples/s
Epoch: 1, Step: 100, Loss: 4.7834, Throughputs: 934.5286 samples/s
Single GPU Acceleration with TorchAcc
By modifying 3 lines of code, you can call TorchAcc’s accelerate interface to accelerate model training:
import time
import torch
+ import torchacc
import torchvision
batch_size = 64
log_steps = 20
inputs = torch.randn(6400, 3, 224, 224)
labels = torch.randint(0, 100, (6400,))
dataset = torch.utils.data.TensorDataset(inputs, labels)
train_loader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, shuffle=True, num_workers=4)
model = torchvision.models.resnet50()
num_ftrs = model.fc.in_features
model.fc = torch.nn.Linear(num_ftrs, 100)
- device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
- model = model.to(device)
+ model, train_loader = torchacc.accelerate(model, train_loader)
+ device = model.device
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
for epoch in range(2):
model.train()
start_time = time.time()
for i, (inputs, labels) in enumerate(train_loader, 1):
- inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
if i % log_steps == 0:
iteration_time = time.time() - start_time
throughputs = batch_size * log_steps / iteration_time
print(f'Epoch: {epoch}, Step: {i}, Loss: {loss:.4f}, Throughputs: {throughputs:.4f} samples/s')
start_time = time.time()
The main changes include:
Remove the original
model to devicelogic, and usetorchacc.accelerateto wrap the model and dataloader;Remove
batch to device, as the dataloader wrapper will automatically handle it.
The results with TorchAcc enabled are as follows:
$ python resnet_acc.py
Epoch: 0, Step: 20, Loss: 5.3560, Throughputs: 38.8063 samples/s
Epoch: 0, Step: 40, Loss: 4.8361, Throughputs: 1144.4567 samples/s
Epoch: 0, Step: 60, Loss: 4.8088, Throughputs: 1141.0203 samples/s
Epoch: 0, Step: 80, Loss: 4.6511, Throughputs: 1131.4401 samples/s
Epoch: 0, Step: 100, Loss: 4.6082, Throughputs: 1191.8075 samples/s
Epoch: 1, Step: 20, Loss: 4.6110, Throughputs: 722.4752 samples/s
Epoch: 1, Step: 40, Loss: 4.6275, Throughputs: 1150.5421 samples/s
Epoch: 1, Step: 60, Loss: 4.6875, Throughputs: 1163.3742 samples/s
Epoch: 1, Step: 80, Loss: 4.6159, Throughputs: 1176.7777 samples/s
Epoch: 1, Step: 100, Loss: 4.6067, Throughputs: 1184.0109 samples/s
It can be observed that the model undergoes compilation optimization at the start of training. After the compilation is completed, the average iteration throughput shows a 16% improvement compared to the native Torch (average throughput improvement after step 40).
Multiple GPUs Acceleration with TorchAcc
Data Parallel
No modifications needed, just replace the execution command with torchrun. TorchAcc will automatically detect the number of GPUs and default to data parallel training.
To better and more accurately view the output information, we can have only rank 0 print the output and multiply the throughputs by the number of GPUs to calculate the global throughputs. The code can be modified as follows:
import time
import torch
import torchacc
import torchvision
batch_size = 64
log_steps = 20
inputs = torch.randn(6400, 3, 224, 224)
labels = torch.randint(0, 100, (6400,))
dataset = torch.utils.data.TensorDataset(inputs, labels)
train_loader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, shuffle=True, num_workers=4)
model = torchvision.models.resnet50()
num_ftrs = model.fc.in_features
model.fc = torch.nn.Linear(num_ftrs, 100)
model, train_loader = torchacc.accelerate(model, train_loader)
device = model.device
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
for epoch in range(2):
model.train()
start_time = time.time()
for i, (inputs, labels) in enumerate(train_loader, 1):
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
- if i % log_steps == 0:
+ if i % log_steps == 0 and torch.distributed.get_rank() == 0:
iteration_time = time.time() - start_time
- throughputs = batch_size * log_steps / iteration_time
+ throughputs = batch_size * log_steps / iteration_time * torch.distributed.get_world_size()
print(f'Epoch: {epoch}, Step: {i}, Loss: {loss:.4f}, Throughputs: {throughputs:.4f} samples/s')
start_time = time.time()
$ torchrun --nproc_per_node=4 resnet_acc.py
Epoch: 0, Step: 20, Loss: 4.8700, Throughputs: 311.2281 samples/s
Epoch: 0, Step: 40, Loss: 4.7842, Throughputs: 4359.9072 samples/s
Epoch: 0, Step: 60, Loss: 4.6687, Throughputs: 4370.7357 samples/s
Epoch: 0, Step: 80, Loss: 4.6690, Throughputs: 4385.6592 samples/s
Epoch: 0, Step: 100, Loss: 4.7295, Throughputs: 4540.9381 samples/s
Epoch: 1, Step: 20, Loss: 4.6978, Throughputs: 2852.0285 samples/s
Epoch: 1, Step: 40, Loss: 4.6760, Throughputs: 4378.8057 samples/s
Epoch: 1, Step: 60, Loss: 4.6696, Throughputs: 3888.6148 samples/s
Epoch: 1, Step: 80, Loss: 4.6757, Throughputs: 4347.2658 samples/s
Epoch: 1, Step: 100, Loss: 4.6497, Throughputs: 4421.5528 samples/s
FSDP (Fully Sharded Data Paarlell)
You only need to configure the TorchAcc Config and pass it to the torchacc.accelerate function to easily achieve FSDP training.
import time
import torch
import torchacc
import torchvision
batch_size = 64
log_steps = 20
inputs = torch.randn(6400, 3, 224, 224)
labels = torch.randint(0, 100, (6400,))
dataset = torch.utils.data.TensorDataset(inputs, labels)
train_loader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, shuffle=True, num_workers=4)
model = torchvision.models.resnet50()
num_ftrs = model.fc.in_features
model.fc = torch.nn.Linear(num_ftrs, 100)
+ config = torchacc.Config()
+ config.dist.fsdp.size = 4
+ config.dist.fsdp.wrap_layer_cls = {"Bottleneck"}
+ model, train_loader = torchacc.accelerate(model, train_loader, config)
device = model.device
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
for epoch in range(2):
model.train()
start_time = time.time()
for i, (inputs, labels) in enumerate(train_loader, 1):
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
if i % log_steps == 0:
iteration_time = time.time() - start_time
throughputs = batch_size * log_steps / iteration_time
print(f'Epoch: {epoch}, Step: {i}, Loss: {loss:.4f}, Throughputs: {throughputs:.4f} samples/s')
start_time = time.time()
The shell command for running FSDP tasks is the same as data parallelism:
$ torchrun --nproc_per_node=4 resnet_acc.py