Quick Start

We use ResNet-50 as an example to demonstrate how to accelerate the training with TorchAcc.

Torch Native Task

Below is the code for an ResNet-50 in Torch:

import time
import torch
import torchvision

batch_size = 64
log_steps = 20
inputs = torch.randn(6400, 3, 224, 224)
labels = torch.randint(0, 100, (6400,))
dataset = torch.utils.data.TensorDataset(inputs, labels)
train_loader = torch.utils.data.DataLoader(
    dataset, batch_size=batch_size, shuffle=True, num_workers=4)

model = torchvision.models.resnet50()
num_ftrs = model.fc.in_features
model.fc = torch.nn.Linear(num_ftrs, 100)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

for epoch in range(4):
    model.train()
    start_time = time.time()
    for i, (inputs, labels) in enumerate(train_loader, 1):
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        if i % log_steps == 0:
            iteration_time = time.time() - start_time
            throughputs = batch_size * log_steps / iteration_time
            print(f'Epoch: {epoch}, Step: {i}, Loss: {loss:.4f}, Throughputs: {throughputs:.4f} samples/s')
            start_time = time.time()

The results on an 80G A100 are as follows:

$ python resnet_native.py

Epoch: 0, Step: 20, Loss: 5.0722, Throughputs: 146.9960 samples/s
Epoch: 0, Step: 40, Loss: 5.6211, Throughputs: 933.5072 samples/s
Epoch: 0, Step: 60, Loss: 5.0938, Throughputs: 933.4178 samples/s
Epoch: 0, Step: 80, Loss: 4.8861, Throughputs: 931.6142 samples/s
Epoch: 0, Step: 100, Loss: 4.8116, Throughputs: 927.7186 samples/s
Epoch: 1, Step: 20, Loss: 4.6499, Throughputs: 777.6132 samples/s
Epoch: 1, Step: 40, Loss: 4.7558, Throughputs: 929.3011 samples/s
Epoch: 1, Step: 60, Loss: 4.6438, Throughputs: 923.1462 samples/s
Epoch: 1, Step: 80, Loss: 4.6413, Throughputs: 933.8570 samples/s
Epoch: 1, Step: 100, Loss: 4.7834, Throughputs: 934.5286 samples/s

Single GPU Acceleration with `TorchAcc`

By modifying 3 lines of code, you can call TorchAcc’s accelerate interface to accelerate model training:

  import time
  import torch
+ import torchacc
  import torchvision

  batch_size = 64
  log_steps = 20
  inputs = torch.randn(6400, 3, 224, 224)
  labels = torch.randint(0, 100, (6400,))
  dataset = torch.utils.data.TensorDataset(inputs, labels)
  train_loader = torch.utils.data.DataLoader(
      dataset, batch_size=batch_size, shuffle=True, num_workers=4)

  model = torchvision.models.resnet50()
  num_ftrs = model.fc.in_features
  model.fc = torch.nn.Linear(num_ftrs, 100)

- device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
- model = model.to(device)
+ model, train_loader = torchacc.accelerate(model, train_loader)
+ device = model.device
  criterion = torch.nn.CrossEntropyLoss()
  optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

  for epoch in range(2):
      model.train()
      start_time = time.time()
      for i, (inputs, labels) in enumerate(train_loader, 1):
-         inputs, labels = inputs.to(device), labels.to(device)
          optimizer.zero_grad()
          outputs = model(inputs)
          loss = criterion(outputs, labels)

          loss.backward()
          optimizer.step()
          if i % log_steps == 0:
              iteration_time = time.time() - start_time
              throughputs = batch_size * log_steps / iteration_time
              print(f'Epoch: {epoch}, Step: {i}, Loss: {loss:.4f}, Throughputs: {throughputs:.4f} samples/s')
              start_time = time.time()

The main changes include:

Remove the original model to device logic, and use torchacc.accelerate to wrap the model and dataloader;
Remove batch to device, as the dataloader wrapper will automatically handle it.

The results with TorchAcc enabled are as follows:

$ python resnet_acc.py

Epoch: 0, Step: 20, Loss: 5.3560, Throughputs: 38.8063 samples/s
Epoch: 0, Step: 40, Loss: 4.8361, Throughputs: 1144.4567 samples/s
Epoch: 0, Step: 60, Loss: 4.8088, Throughputs: 1141.0203 samples/s
Epoch: 0, Step: 80, Loss: 4.6511, Throughputs: 1131.4401 samples/s
Epoch: 0, Step: 100, Loss: 4.6082, Throughputs: 1191.8075 samples/s
Epoch: 1, Step: 20, Loss: 4.6110, Throughputs: 722.4752 samples/s
Epoch: 1, Step: 40, Loss: 4.6275, Throughputs: 1150.5421 samples/s
Epoch: 1, Step: 60, Loss: 4.6875, Throughputs: 1163.3742 samples/s
Epoch: 1, Step: 80, Loss: 4.6159, Throughputs: 1176.7777 samples/s
Epoch: 1, Step: 100, Loss: 4.6067, Throughputs: 1184.0109 samples/s

It can be observed that the model undergoes compilation optimization at the start of training. After the compilation is completed, the average iteration throughput shows a 16% improvement compared to the native Torch (average throughput improvement after step 40).

Multiple GPUs Acceleration with `TorchAcc`

Data Parallel

No modifications needed, just replace the execution command with torchrun. TorchAcc will automatically detect the number of GPUs and default to data parallel training.

To better and more accurately view the output information, we can have only rank 0 print the output and multiply the throughputs by the number of GPUs to calculate the global throughputs. The code can be modified as follows:

  import time
  import torch
  import torchacc
  import torchvision

  batch_size = 64
  log_steps = 20
  inputs = torch.randn(6400, 3, 224, 224)
  labels = torch.randint(0, 100, (6400,))
  dataset = torch.utils.data.TensorDataset(inputs, labels)
  train_loader = torch.utils.data.DataLoader(
      dataset, batch_size=batch_size, shuffle=True, num_workers=4)

  model = torchvision.models.resnet50()
  num_ftrs = model.fc.in_features
  model.fc = torch.nn.Linear(num_ftrs, 100)

  model, train_loader = torchacc.accelerate(model, train_loader)
  device = model.device
  criterion = torch.nn.CrossEntropyLoss()
  optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

  for epoch in range(2):
      model.train()
      start_time = time.time()
      for i, (inputs, labels) in enumerate(train_loader, 1):
          optimizer.zero_grad()
          outputs = model(inputs)
          loss = criterion(outputs, labels)

          loss.backward()
          optimizer.step()
-         if i % log_steps == 0:
+         if i % log_steps == 0 and torch.distributed.get_rank() == 0:
              iteration_time = time.time() - start_time
-             throughputs = batch_size * log_steps / iteration_time
+             throughputs = batch_size * log_steps / iteration_time * torch.distributed.get_world_size()
              print(f'Epoch: {epoch}, Step: {i}, Loss: {loss:.4f}, Throughputs: {throughputs:.4f} samples/s')
              start_time = time.time()

$ torchrun --nproc_per_node=4 resnet_acc.py

Epoch: 0, Step: 20, Loss: 4.8700, Throughputs: 311.2281 samples/s
Epoch: 0, Step: 40, Loss: 4.7842, Throughputs: 4359.9072 samples/s
Epoch: 0, Step: 60, Loss: 4.6687, Throughputs: 4370.7357 samples/s
Epoch: 0, Step: 80, Loss: 4.6690, Throughputs: 4385.6592 samples/s
Epoch: 0, Step: 100, Loss: 4.7295, Throughputs: 4540.9381 samples/s
Epoch: 1, Step: 20, Loss: 4.6978, Throughputs: 2852.0285 samples/s
Epoch: 1, Step: 40, Loss: 4.6760, Throughputs: 4378.8057 samples/s
Epoch: 1, Step: 60, Loss: 4.6696, Throughputs: 3888.6148 samples/s
Epoch: 1, Step: 80, Loss: 4.6757, Throughputs: 4347.2658 samples/s
Epoch: 1, Step: 100, Loss: 4.6497, Throughputs: 4421.5528 samples/s

FSDP (Fully Sharded Data Paarlell)

You only need to configure the TorchAcc Config and pass it to the torchacc.accelerate function to easily achieve FSDP training.

  import time
  import torch
  import torchacc
  import torchvision

  batch_size = 64
  log_steps = 20
  inputs = torch.randn(6400, 3, 224, 224)
  labels = torch.randint(0, 100, (6400,))
  dataset = torch.utils.data.TensorDataset(inputs, labels)
  train_loader = torch.utils.data.DataLoader(
      dataset, batch_size=batch_size, shuffle=True, num_workers=4)

  model = torchvision.models.resnet50()
  num_ftrs = model.fc.in_features
  model.fc = torch.nn.Linear(num_ftrs, 100)

+ config = torchacc.Config()
+ config.dist.fsdp.size = 4
+ config.dist.fsdp.wrap_layer_cls = {"Bottleneck"}
+ model, train_loader = torchacc.accelerate(model, train_loader, config)
  device = model.device

  criterion = torch.nn.CrossEntropyLoss()
  optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

  for epoch in range(2):
      model.train()
      start_time = time.time()
      for i, (inputs, labels) in enumerate(train_loader, 1):
          optimizer.zero_grad()
          outputs = model(inputs)
          loss = criterion(outputs, labels)
          loss.backward()
          optimizer.step()

          if i % log_steps == 0:
              iteration_time = time.time() - start_time
              throughputs = batch_size * log_steps / iteration_time
              print(f'Epoch: {epoch}, Step: {i}, Loss: {loss:.4f}, Throughputs: {throughputs:.4f} samples/s')
              start_time = time.time()

The shell command for running FSDP tasks is the same as data parallelism:

$ torchrun --nproc_per_node=4 resnet_acc.py