# Introduction

TorchAcc is a PyTorch distributed training acceleration framework provided by Alibaba Cloud's PAI platform.

TorchAcc leverages the work of the [PyTorch/XLA](https://github.com/pytorch/xla) to provide users with training acceleration capabilities. At the same time, we have conducted a considerable amount of targeted optimization based on GPU. TorchAcc offers better usability, superior performance, and richer functionality.

## Main Features

The key features of TorchAcc:

* Rich distributed Parallelism
    * Data Parallelism
    * Fully Sharded Data Parallelism
    * Tensor Parallelism
    * Pipeline Parallelism
    * [Ulysess](https://arxiv.org/abs/2309.14509)
    * [Ring Attention](https://arxiv.org/abs/2310.01889)
    * Flash Sequence (Solution for Long Sequence)
* Low Memory Cost
* High Performance
* Ease use


## Model Performance

Below is a summary of the performance improvements for some common algorithms after integrating with TorchAcc.

![llama](resources/llama_throughput.png)

![cv](resources/cv_throughput.png)

![gpt2](resources/gpt2_throughput.png)

Note:
* Swin, DeiT, and ConvLSTM were tested in an environment with 8x 80G A100 GPUs per machine, with inter-machine bandwidth of 800Gb (the performance data shown is for a single card). Scalability across multiple machines is nearly linear.
* GPT-2 was tested in an environment with 8 x 16G V100 GPUs per machine, with inter-machine bandwidth of 30Gb.
* LLAMA-7B was tested in an environment with 2 machines, each with 8 x 80G A100 GPUs, and inter-machine bandwidth of 800Gb.