Spaces:
Running
Running
DeepSpeed | |
DeepSpeed is a PyTorch optimization library that makes distributed training memory-efficient and fast. At it's core is the Zero Redundancy Optimizer (ZeRO) which enables training large models at scale. ZeRO works in several stages: | |
ZeRO-1, optimizer state partioning across GPUs | |
ZeRO-2, gradient partitioning across GPUs | |
ZeRO-3, parameteter partitioning across GPUs | |
In GPU-limited environments, ZeRO also enables offloading optimizer memory and computation from the GPU to the CPU to fit and train really large models on a single GPU. DeepSpeed is integrated with the Transformers [Trainer] class for all ZeRO stages and offloading. All you need to do is provide a config file or you can use a provided template. For inference, Transformers support ZeRO-3 and offloading since it allows loading huge models. | |
This guide will walk you through how to deploy DeepSpeed training, the features you can enable, how to setup the config files for different ZeRO stages, offloading, inference, and using DeepSpeed without the [Trainer]. | |
Installation | |
DeepSpeed is available to install from PyPI or Transformers (for more detailed installation options, take a look at the DeepSpeed installation details or the GitHub README). | |
If you're having difficulties installing DeepSpeed, check the DeepSpeed CUDA installation guide. While DeepSpeed has a pip installable PyPI package, it is highly recommended to install it from source to best match your hardware and to support certain features, like 1-bit Adam, which aren’t available in the PyPI distribution. | |
pip install deepspeed | |
pip install transformers[deepspeed] | |
Memory requirements | |
Before you begin, it is a good idea to check whether you have enough GPU and CPU memory to fit your model. DeepSpeed provides a tool for estimating the required CPU/GPU memory. For example, to estimate the memory requirements for the bigscience/T0_3B model on a single GPU: | |
$ python -c 'from transformers import AutoModel; \ | |
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \ | |
model = AutoModel.from_pretrained("bigscience/T0_3B"); \ | |
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)' | |
[] | |
Estimated memory needed for params, optim states and gradients for a: | |
HW: Setup with 1 node, 1 GPU per node. | |
SW: Model with 2783M total params, 65M largest layer params. | |
per CPU | per GPU | Options | |
70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 | |
70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 | |
62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=1 | |
62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=0 | |
0.37GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=1 | |
15.56GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=0 | |
This means you either need a single 80GB GPU without CPU offload or a 8GB GPU and a ~60GB CPU to offload to (these are just the memory requirements for the parameters, optimizer states and gradients, and you'll need a bit more for the CUDA kernels and activations). You should also consider the tradeoff between cost and speed because it'll be cheaper to rent or buy a smaller GPU but it'll take longer to train your model. | |
If you have enough GPU memory make sure you disable CPU/NVMe offload to make everything faster. | |
Select a ZeRO stage | |
After you've installed DeepSpeed and have a better idea of your memory requirements, the next step is selecting a ZeRO stage to use. In order of fastest and most memory-efficient: | |
| Fastest | Memory efficient | | |
|------------------|------------------| | |
| ZeRO-1 | ZeRO-3 + offload | | |
| ZeRO-2 | ZeRO-3 | | |
| ZeRO-2 + offload | ZeRO-2 + offload | | |
| ZeRO-3 | ZeRO-2 | | |
| ZeRO-3 + offload | ZeRO-1 | | |
To find what works best for you, start with the fastest approach and if you run out of memory, try the next stage which is slower but more memory efficient. Feel free to work in whichever direction you prefer (starting with the most memory efficient or fastest) to discover the appropriate balance between speed and memory usage. | |
A general process you can use is (start with batch size of 1): | |
enable gradient checkpointing | |
try ZeRO-2 | |
try ZeRO-2 and offload the optimizer | |
try ZeRO-3 | |
try ZeRO-3 and offload parameters to the CPU | |
try ZeRO-3 and offload parameters and the optimizer to the CPU | |
try lowering various default values like a narrower search beam if you're using the [~GenerationMixin.generate] method | |
try mixed half-precision (fp16 on older GPU architectures and bf16 on Ampere) over full-precision weights | |
add more hardware if possible or enable Infinity to offload parameters and the optimizer to a NVMe | |
once you're not running out of memory, measure effective throughput and then try to increase the batch size as large as you can to maximize GPU efficiency | |
lastly, try to optimize your training setup by disabling some offload features or use a faster ZeRO stage and increasing/decreasing the batch size to find the best tradeoff between speed and memory usage | |
DeepSpeed configuration file | |
DeepSpeed works with the [Trainer] class by way of a config file containing all the parameters for configuring how you want setup your training run. When you execute your training script, DeepSpeed logs the configuration it received from [Trainer] to the console so you can see exactly what configuration was used. | |
Find a complete list of DeepSpeed configuration options on the DeepSpeed Configuration JSON reference. You can also find more practical examples of various DeepSpeed configuration examples on the DeepSpeedExamples repository or the main DeepSpeed repository. To quickly find specific examples, you can: | |
```bash | |
git clone https://github.com/microsoft/DeepSpeedExamples | |
cd DeepSpeedExamples | |
find . -name '*json' | |
find examples with the Lamb optimizer | |
grep -i Lamb $(find . -name '*json') | |
The DeepSpeed configuration file is passed as a path to a JSON file if you're training from the command line interface or as a nested dict object if you're using the [Trainer] in a notebook setting. | |
py | |
TrainingArguments(, deepspeed="path/to/deepspeed_config.json") | |
py | |
ds_config_dict = dict(scheduler=scheduler_params, optimizer=optimizer_params) | |
args = TrainingArguments(, deepspeed=ds_config_dict) | |
trainer = Trainer(model, args, ) | |
DeepSpeed and Trainer parameters | |
There are three types of configuration parameters: | |
Some of the configuration parameters are shared by [Trainer] and DeepSpeed, and it can be difficult to identify errors when there are conflicting definitions. To make it easier, these shared configuration parameters are configured from the [Trainer] command line arguments. | |
Some configuration parameters that are automatically derived from the model configuration so you don't need to manually adjust these values. The [Trainer] uses a configuration value auto to determine set the most correct or efficient value. You could set your own configuration parameters explicitly, but you must take care to ensure the [Trainer] arguments and DeepSpeed configuration parameters agree. Mismatches may cause the training to fail in very difficult to detect ways! | |
Some configuration parameters specific to DeepSpeed only which need to be manually set based on your training needs. | |
You could also modify the DeepSpeed configuration and edit [TrainingArguments] from it: | |
Create or load a DeepSpeed configuration to used as the main configuration | |
Create a [TrainingArguments] object based on these DeepSpeed configuration values | |
Some values, such as scheduler.params.total_num_steps are calculated by the [Trainer] during training. | |
ZeRO configuration | |
There are three configurations, each corresponding to a different ZeRO stage. Stage 1 is not as interesting for scalability, and this guide focuses on stages 2 and 3. The zero_optimization configuration contains all the options for what to enable and how to configure them. For a more detailed explanation of each parameter, take a look at the DeepSpeed Configuration JSON reference. | |
DeepSpeed doesn’t validate parameter names and any typos fallback on the parameter's default setting. You can watch the DeepSpeed engine startup log messages to see what values it is going to use. | |
The following configurations must be setup with DeepSpeed because the [Trainer] doesn't provide equivalent command line arguments. | |
ZeRO-1 shards the optimizer states across GPUs, and you can expect a tiny speed up. The ZeRO-1 config can be setup like this: | |
yml | |
{ | |
"zero_optimization": { | |
"stage": 1 | |
} | |
} | |
ZeRO-2 shards the optimizer and gradients across GPUs. This stage is primarily used for training since it's features are not relevant to inference. Some important parameters to configure for better performance include: | |
offload_optimizer should be enabled to reduce GPU memory usage. | |
overlap_comm when set to true trades off increased GPU memory usage to lower allreduce latency. This feature uses 4.5x the allgather_bucket_size and reduce_bucket_size values. In this example, they're set to 5e8 which means it requires 9GB of GPU memory. If your GPU memory is 8GB or less, you should reduce overlap_comm to lower the memory requirements and prevent an out-of-memory (OOM) error. | |
allgather_bucket_size and reduce_bucket_size trade off available GPU memory for communication speed. The smaller their values, the slower communication is and the more GPU memory is available. You can balance, for example, whether a bigger batch size is more important than a slightly slower training time. | |
round_robin_gradients is available in DeepSpeed 0.4.4 for CPU offloading. It parallelizes gradient copying to CPU memory among ranks by fine-grained gradient partitioning. Performance benefit grows with gradient accumulation steps (more copying between optimizer steps) or GPU count (increased parallelism). | |
yml | |
{ | |
"zero_optimization": { | |
"stage": 2, | |
"offload_optimizer": { | |
"device": "cpu", | |
"pin_memory": true | |
}, | |
"allgather_partitions": true, | |
"allgather_bucket_size": 5e8, | |
"overlap_comm": true, | |
"reduce_scatter": true, | |
"reduce_bucket_size": 5e8, | |
"contiguous_gradients": true | |
"round_robin_gradients": true | |
} | |
} | |
ZeRO-3 shards the optimizer, gradient, and parameters across GPUs. Unlike ZeRO-2, ZeRO-3 can also be used for inference, in addition to training, because it allows large models to be loaded on multiple GPUs. Some important parameters to configure include: | |
device: "cpu" can help if you're running out of GPU memory and if you have free CPU memory available. This allows offloading model parameters to the CPU. | |
pin_memory: true can improve throughput, but less memory becomes available for other processes because the pinned memory is reserved for the specific process that requested it and it's typically accessed much faster than normal CPU memory. | |
stage3_max_live_parameters is the upper limit on how many full parameters you want to keep on the GPU at any given time. Reduce this value if you encounter an OOM error. | |
stage3_max_reuse_distance is a value for determining when a parameter is used again in the future, and it helps decide whether to throw the parameter away or to keep it. If the parameter is going to be reused (if the value is less than stage3_max_reuse_distance), then it is kept to reduce communication overhead. This is super helpful when activation checkpointing is enabled and you want to keep the parameter in the forward recompute until the backward pass. But reduce this value if you encounter an OOM error. | |
stage3_gather_16bit_weights_on_model_save consolidates fp16 weights when a model is saved. For large models and multiple GPUs, this is an expensive in terms of memory and speed. You should enable it if you're planning on resuming training. | |
sub_group_size controls which parameters are updated during the optimizer step. Parameters are grouped into buckets of sub_group_size and each bucket is updated one at a time. When used with NVMe offload, sub_group_size determines when model states are moved in and out of CPU memory from during the optimization step. This prevents running out of CPU memory for extremely large models. sub_group_size can be left to its default value if you aren't using NVMe offload, but you may want to change it if you: | |
Run into an OOM error during the optimizer step. In this case, reduce sub_group_size to reduce memory usage of the temporary buffers. | |
The optimizer step is taking a really long time. In this case, increase sub_group_size to improve bandwidth utilization as a result of increased data buffers. | |
reduce_bucket_size, stage3_prefetch_bucket_size, and stage3_param_persistence_threshold are dependent on a model's hidden size. It is recommended to set these values to auto and allow the [Trainer] to automatically assign the values. | |
yml | |
{ | |
"zero_optimization": { | |
"stage": 3, | |
"offload_optimizer": { | |
"device": "cpu", | |
"pin_memory": true | |
}, | |
"offload_param": { | |
"device": "cpu", | |
"pin_memory": true | |
}, | |
"overlap_comm": true, | |
"contiguous_gradients": true, | |
"sub_group_size": 1e9, | |
"reduce_bucket_size": "auto", | |
"stage3_prefetch_bucket_size": "auto", | |
"stage3_param_persistence_threshold": "auto", | |
"stage3_max_live_parameters": 1e9, | |
"stage3_max_reuse_distance": 1e9, | |
"stage3_gather_16bit_weights_on_model_save": true | |
} | |
} | |
You can use the deepspeed.zero.Init context manager to initialize a model faster: | |
from transformers import T5ForConditionalGeneration, T5Config | |
import deepspeed | |
with deepspeed.zero.Init(): | |
config = T5Config.from_pretrained("google-t5/t5-small") | |
model = T5ForConditionalGeneration(config) | |
For pretrained models, the DeepSped config file needs to have is_deepspeed_zero3_enabled: true setup in [TrainingArguments] and it needs a ZeRO configuration enabled. The [TrainingArguments] object must be created before calling the model [~PreTrainedModel.from_pretrained]. | |
from transformers import AutoModel, Trainer, TrainingArguments | |
training_args = TrainingArguments(, deepspeed=ds_config) | |
model = AutoModel.from_pretrained("google-t5/t5-small") | |
trainer = Trainer(model=model, args=training_args, ) | |
You'll need ZeRO-3 if the fp16 weights don't fit on a single GPU. If you're able to load fp16 weights, then make sure you specify torch_dtype=torch.float16 in [~PreTrainedModel.from_pretrained]. | |
Another consideration for ZeRO-3 is if you have multiple GPUs, no single GPU has all the parameters unless it's the parameters for the currently executing layer. To access all parameters from all the layers at once, such as loading pretrained model weights in [~PreTrainedModel.from_pretrained], one layer is loaded at a time and immediately partitioned to all GPUs. This is because for very large models, it isn't possible to load the weights on one GPU and then distribute them across the other GPUs due to memory limitations. | |
If you encounter a model parameter weight that looks like the following, where tensor([1.]) or the parameter size is 1 instead of a larger multi-dimensional shape, this means the parameter is partitioned and this is a ZeRO-3 placeholder. | |
py | |
tensor([1.0], device="cuda:0", dtype=torch.float16, requires_grad=True) | |
For more information about initializing large models with ZeRO-3 and accessing the parameters, take a look at the Constructing Massive Models and Gathering Parameters guides. | |
NVMe configuration | |
ZeRO-Infinity allows offloading model states to the CPU and/or NVMe to save even more memory. Smart partitioning and tiling algorithms allow each GPU to send and receive very small amounts of data during offloading such that a modern NVMe can fit an even larger total memory pool than is available to your training process. ZeRO-Infinity requires ZeRO-3. | |
Depending on the CPU and/or NVMe memory available, you can offload both the optimizer states and parameters, just one of them, or none. You should also make sure the nvme_path is pointing to an NVMe device, because while it still works with a normal hard drive or solid state drive, it'll be significantly slower. With a modern NVMe, you can expect peak transfer speeds of ~3.5GB/s for read and ~3GB/s for write operations. Lastly, run a benchmark on your training setup to determine the optimal aio configuration. | |
The example ZeRO-3/Infinity configuration file below sets most of the parameter values to auto, but you could also manually add these values. | |
```yml | |
{ | |
"fp16": { | |
"enabled": "auto", | |
"loss_scale": 0, | |
"loss_scale_window": 1000, | |
"initial_scale_power": 16, | |
"hysteresis": 2, | |
"min_loss_scale": 1 | |
}, | |
"optimizer": { | |
"type": "AdamW", | |
"params": { | |
"lr": "auto", | |
"betas": "auto", | |
"eps": "auto", | |
"weight_decay": "auto" | |
} | |
}, | |
"scheduler": { | |
"type": "WarmupLR", | |
"params": { | |
"warmup_min_lr": "auto", | |
"warmup_max_lr": "auto", | |
"warmup_num_steps": "auto" | |
} | |
}, |