# HuggingFace Transformers

This is a best Practices for Accelerating Training with HuggingFace Transformers. TorchAcc supports acceleration of native HuggingFace Transformers. For users using the HF `Trainer` interface, it is very convenient to accelerate Transformers model training through TorchAcc.

The following will use the `run_clm.py` example script from the Transformers library to train `Llama3-8B` tasks as examples, demonstrating how to use `TorchAcc` to accelerate Transformers training. We will also compare the methods of training Transformers models using `native PyTorch` and `DeepSpeed` with the `run_clm.py` script.


## Environment Preparation

### Start a container

Refer to the `"install"` section to obtain the latest image:

```bash
sudo docker run --gpus all --net host --ipc host --shm-size 10G -it --rm --cap-add=SYS_PTRACE registry.cn-wulanchabu.aliyuncs.com/pai-dlc/acc:r2.3.0-cuda12.1.0-py3.10-nightly bash
```

### Environment Configuration

Since we are running the `run_clm.py` script built into Transformers, it requires source code installation of the Transformers library:

```bash
# Uninstall Transformers already installed in the image
pip uninstall transformers -y

# Clone and install Transformers
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install -e .

# Install related dependencies
pip install evaluate scikit-learn

# If DeepSpeed training is needed
pip install deepspeed
```

### Preparation

> If your network cannot access HuggingFace, please refer to the following instructions to manually download the model. Otherwise, you can skip the `Model Preparation` and `Dataset Preparation` sections.

#### Model Preparation

You can download the `Meta-Llama-3-8B` model from the model repositories of `HuggingFace` or `ModelScope`. Here's an example using `ModelScope`, with the model repository link: [https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B](https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B/files).

You can clone the model configuration and weights to your local machine using `git clone`, and store them in a specified directory (such as Meta-Llama-3-8B):

```bash
apt-get update && apt-get install git-lfs
git clone https://www.modelscope.cn/LLM-Research/Meta-Llama-3-8B.git
```

### Dataset Preparation

We use the wikitext dataset for training the model. For detailed information about the dataset, visit: [https://www.modelscope.cn/datasets/modelscope/wikitext/](https://www.modelscope.cn/datasets/modelscope/wikitext/)

You can download the dataset and place it in the specified directory.

### Enabling FlashAttention

In the `run_clm.py` file, find the `AutoModelForCausalLM.from_pretrained()` and `AutoModelForCausalLM.from_config()`, and add `attn_implementation="flash_attention_2"` to enable FlashAttention computation.

```diff
diff --git a/examples/pytorch/language-modeling/run_clm.py b/examples/pytorch/language-modeling/run_clm.py
index c0db57037..dc8e3040a 100755
--- a/examples/pytorch/language-modeling/run_clm.py
+++ b/examples/pytorch/language-modeling/run_clm.py
@@ -434,9 +436,10 @@ def main():
                trust_remote_code=model_args.trust_remote_code,
                torch_dtype=torch_dtype,
                low_cpu_mem_usage=model_args.low_cpu_mem_usage,
+               attn_implementation='flash_attention_2'
            )
        else:
-        model = AutoModelForCausalLM.from_config(config, trust_remote_code=model_args.trust_remote_code)
+        model = AutoModelForCausalLM.from_config(config, trust_remote_code=model_args.trust_remote_code, attn_implementation='flash_attention_2')
            n_params = sum({p.data_ptr(): p.numel() for p in model.parameters()}.values())
            logger.info(f"Training new model from scratch - Total size={n_params/2**20:.2f}M params")
```


## PyTorch Native Training

You can follow these steps to conduct Transformers Llama3-8B PyTorch native FSDP training.

### Configure FSDP config file

When running FSDP training tasks using `run_clm.py`, you need to configure this file:

**Note: Due to a bug in Transformers, activation_checkpoint cannot be enabled for this task.**

```json
{
    "fsdp_transformer_layer_cls_to_wrap": [
        "LlamaDecoderLayer"
    ],
    "activation_checkpointing": false
}
```

### Training command

We utilize the `run_clm.py` file from the Transformers library to run directly by specifying parameters without needing additional code. The `run_clm.py` file encapsulates the logic of the Transformers library's Trainer and provides various parameter configurations. For specific parameter information, you can run `python run_clm.py --help`.

Use `torchrun` to start the training. The specific command is as follows:


```bash
set -ex

echo "Running a native torch job ..."

export USE_TORCH_XLA=0

[ -z "$RANK" ] && RANK=0
[ -z "$WORLD_SIZE" ] && WORLD_SIZE=1
[ -z "$MASTER_ADDR" ] && MASTER_ADDR=127.0.0.1
[ -z "$MASTER_PORT" ] && MASTER_PORT=9010

BS=1
SEQLEN=4096
NPROC_PER_NODE=8
PRECISION="bf16=true"
FSDP_CONFIG="llama3_fsdp_native.json"
JOB_NAME="LLAMA3_FSDP_NATIVE_GPU${NPROC_PER_NODE}_BS${BS}_SEQLEN${SEQLEN}_BF16_FA"


torchrun --nproc_per_node $NPROC_PER_NODE \
    --nnodes $WORLD_SIZE \
    --node_rank $RANK \
    --master_port $MASTER_PORT \
    --master_addr $MASTER_ADDR \
    ../examples/pytorch/language-modeling/run_clm.py \
    --num_train_epochs 2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-103-raw-v1 \
    --use_fast_tokenizer false \
    --per_device_train_batch_size $BS \
    --per_device_eval_batch_size $BS \
    --do_train \
    --output_dir /tmp/test-clm \
    --overwrite_output_dir \
    --config_name ./Meta-Llama-3-8B/ \
    --tokenizer_name ./Meta-Llama-3-8B/ \
    --trust_remote_code true \
    --cache_dir ./cache \
    --block_size $SEQLEN \
    --optim adamw_torch \
    --save_strategy no \
    --logging_strategy steps \
    --gradient_checkpointing no \
    --logging_steps 100 \
    --$PRECISION \
    --fsdp "auto_wrap" \
    --fsdp_config $FSDP_CONFIG 2>&1 | tee ./$JOB_NAME.log
```


## DeepSpeed Training

### Configuring DeepSpeed Config

Specific configuration details can be found in the official documentation: [https://www.deepspeed.ai/docs/config-json/](https://www.deepspeed.ai/docs/config-json/). To align with the Transformers native and torchacc's FSDP training tasks, we configure the DeepSpeed file as follows:

* The zero3 training strategy is the same as FSDP.
* train_batch_size = train_micro_batch_size_per_gpu * number of training cards, should align with `batch_size` in the training script.

> The Transformers library does not recognize activation checkpointing configured in DeepSpeed config, so activation checkpointing configuration is unnecessary.

```json
{
    "train_batch_size": 8,
    "train_micro_batch_size_per_gpu": 1,
    "optimizer": {
        "type": "AdamW"
    },
    "zero_optimization": {
        "stage": 3
    },
    "bf16": {
        "enabled": true
    }
}
```

### Training Script


```bash
set -ex

echo "Running a deepspeed job ..."

export USE_TORCH_XLA=0

[ -z "$RANK" ] && RANK=0
[ -z "$WORLD_SIZE" ] && WORLD_SIZE=1
[ -z "$MASTER_ADDR" ] && MASTER_ADDR=127.0.0.1
[ -z "$MASTER_PORT" ] && MASTER_PORT=9010

BS=1
SEQLEN=4096
NPROC_PER_NODE=8
PRECISION="bf16=true"
FSDP_CONFIG="llama3_fsdp_ds.json"
JOB_NAME="LLAMA3_FSDP_DEEPSPEED_GPU${NPROC_PER_NODE}_BS${BS}_SEQLEN${SEQLEN}_BF16_FA"


torchrun --nproc_per_node $NPROC_PER_NODE \
    --nnodes $WORLD_SIZE \
    --node_rank $RANK \
    --master_port $MASTER_PORT \
    --master_addr $MASTER_ADDR \
    ./examples/pytorch/language-modeling/run_clm.py \
    --num_train_epochs 2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-103-raw-v1 \
    --use_fast_tokenizer false \
    --per_device_train_batch_size $BS \
    --per_device_eval_batch_size $BS \
    --do_train \
    --output_dir /tmp/test-clm \
    --overwrite_output_dir \
    --config_name ./Meta-Llama-3-8B/ \
    --tokenizer_name ./Meta-Llama-3-8B/ \
    --trust_remote_code true \
    --cache_dir ./cache \
    --block_size $SEQLEN \
    --optim adamw_torch \
    --save_strategy no \
    --logging_strategy steps \
    --gradient_checkpointing no \
    --logging_steps 100 \
    --$PRECISION \
    --deepspeed $FSDP_CONFIG 2>&1 | tee ./$JOB_NAME.log
```

## TorchAcc Training

If you want to accelerate the training of `Transformers llama3-8b` with Torchacc, you need to make the following changes:

Open the `examples/pytorch/language-modeling/run_clm.py` file and insert the following code at the very top:

```python
import torchacc
torchacc.utils.patch.patch_llama(True)
```

### Configure TorchAcc FSDP Config

You can control the `xla_fsdp_grad_ckpt` parameter to enable or disable gradient checkpointing.

```json
{
    "fsdp_transformer_layer_cls_to_wrap": [
        "LlamaDecoderLayer"
    ],
    "xla": true,
    "xla_fsdp_settings": {
        "compute_dtype": "bfloat16",
        "buffer_dtype": "bfloat16",
        "opt_flatten_overlap": true,
        "pin_layout_in_collective_ops": false,
        "flatten_parameters": true
    },
    "xla_fsdp_grad_ckpt": false
}
```

### Training Script


```bash
set -ex

echo "Running a torch job with torchacc ..."

export PJRT_ALLOCATOR_FRACTION=0.97
export PJRT_DEVICE=CUDA
#export XLA_PERSISTENT_CACHE_PATH=./compiled_cache # uncomment this line to cache the compile results and speed up initialization.

[ -z "$RANK" ] && RANK=0
[ -z "$WORLD_SIZE" ] && WORLD_SIZE=1
[ -z "$MASTER_ADDR" ] && MASTER_ADDR=127.0.0.1
[ -z "$MASTER_PORT" ] && MASTER_PORT=9010

BS=3
SEQLEN=4096
NPROC_PER_NODE=8
PRECISION="bf16=true"
FSDP_CONFIG="llama3_fsdp_acc.json"
JOB_NAME="LLAMA3_FSDP_TORCHACC_GPU${NPROC_PER_NODE}_BS${BS}_SEQLEN${SEQLEN}_BF16_FA"


torchrun --nproc_per_node $NPROC_PER_NODE \
    --nnodes $WORLD_SIZE \
    --node_rank $RANK \
    --master_port $MASTER_PORT \
    --master_addr $MASTER_ADDR \
    ./examples/pytorch/language-modeling/run_clm.py \
    --num_train_epochs 2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-103-raw-v1 \
    --use_fast_tokenizer false \
    --per_device_train_batch_size $BS \
    --per_device_eval_batch_size $BS \
    --do_train \
    --output_dir /tmp/test-clm \
    --overwrite_output_dir \
    --config_name ./Meta-Llama-3-8B/ \
    --tokenizer_name ./Meta-Llama-3-8B/ \
    --trust_remote_code true \
    --cache_dir ./cache \
    --block_size $SEQLEN \
    --optim adamw_torch \
    --save_strategy no \
    --logging_strategy steps \
    --gradient_checkpointing no \
    --logging_steps 100 \
    --$PRECISION \
    --fsdp "auto_wrap" \
    --fsdp_config $FSDP_CONFIG 2>&1 | tee ./$JOB_NAME.log
```

## Performance

The following is a comparison of various configurations tested for Torch native, DeepSpeed, and TorchAcc, aiming to select the optimal performance configuration for each framework.

Experimental Parameters:

1. `flash_attn==2.5.6`
2. Sequence Length = 4096
3. Compute Resources: 8 * 80G A100
4. transformers commit hash: [f91c16d270e5e3ff32fdb32ccf286d05c03dfa66](https://github.com/huggingface/transformers/tree/f91c16d270e5e3ff32fdb32ccf286d05c03dfa66)


Here is the table with the last two rows removed:

| Global Batch Size | PyTorch | DeepSpeed | TorchAcc |
| --- | --- | --- | --- |
| 8 | 2945.0 tokens/s/GPU | 3123.2 tokens/s/GPU | 3276.8 tokens/s/GPU |
| 16 | OOM | OOM | 3737.6 token/s/GPU |
| 24 | OOM | OOM | 4044.8 tokens/s/GPU |

- Optimal PyTorch Configuration: BS=8+FA+noGC, Throughput: 2945.0 tokens/perGPU/s
- Optimal DeepSpeed Configuration: BS=8+FA+noGC, Throughput: 3123.2 tokens/perGPU/s, showing a **6%** improvement over
PyTorch's optimal performance.
- Optimal TorchAcc Configuration: BS=24+FA+noGC, Throughput: 4044.8 tokens/perGPU/s, showing a **37%** improvement over PyTorch's optimal performance and a **30%** improvement over DeepSpeed's optimal performance.