thenativefox
Added split files and tables
939262b
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 4,
"fast_init": false
},
"offload_param": {
"device": "nvme",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
},
"aio": {
"block_size": 262144,
"queue_depth": 32,
"thread_count": 1,
"single_submit": false,
"overlap_events": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
DeepSpeed features
There are a number of important parameters to specify in the DeepSpeed configuration file which are briefly described in this section.
Activation/gradient checkpointing
Activation and gradient checkpointing trades speed for more GPU memory which allows you to overcome scenarios where your GPU is out of memory or to increase your batch size for better performance. To enable this feature:
For a Hugging Face model, set model.gradient_checkpointing_enable() or --gradient_checkpointing in the [Trainer].
For a non-Hugging Face model, use the DeepSpeed Activation Checkpointing API. You could also replace the Transformers modeling code and replace torch.utils.checkpoint with the DeepSpeed API. This approach is more flexible because you can offload the forward activations to the CPU memory instead of recalculating them.
Optimizer and scheduler
DeepSpeed and Transformers optimizer and scheduler can be mixed and matched as long as you don't enable offload_optimizer. When offload_optimizer is enabled, you could use a non-DeepSpeed optimizer (except for LAMB) as long as it has both a CPU and GPU implementation.
The optimizer and scheduler parameters for the config file can be set from the command line to avoid hard to find errors. For example, if the learning rate is set to a different value in another place you can override it from the command line. Aside from the optimizer and scheduler parameters, you'll need to ensure your [Trainer] command line arguments match the DeepSpeed configuration.
DeepSpeed offers several optimizers (Adam, AdamW, OneBitAdam, and LAMB) but you can also import other optimizers from PyTorch. If you don't configure the optimizer in the config, the [Trainer] automatically selects AdamW and either uses the supplied values or the default values for the following parameters from the command line: lr, adam_beta1, adam_beta2, adam_epsilon, weight_decay.
You can set the parameters to "auto" or manually input your own desired values.
yaml
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
}
}
You can also use an unsupported optimizer by adding the following to the top level configuration.
yaml
{
"zero_allow_untested_optimizer": true
}
From DeepSpeed==0.8.3 on, if you want to use offload, you'll also need to the following to the top level configuration because offload works best with DeepSpeed's CPU Adam optimizer.
yaml
{
"zero_force_ds_cpu_optimizer": false
}
DeepSpeed supports the LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR learning rate schedulers.
Transformers and DeepSpeed provide two of the same schedulers:
WarmupLR is the same as --lr_scheduler_type constant_with_warmup in Transformers
WarmupDecayLR is the same as --lr_scheduler_type linear in Transformers (this is the default scheduler used in Transformers)
If you don't configure the scheduler in the config, the [Trainer] automatically selects WarmupDecayLR and either uses the supplied values or the default values for the following parameters from the command line: warmup_min_lr, warmup_max_lr, warmup_num_steps, total_num_steps (automatically calculated during run time if max_steps is not provided).
You can set the parameters to "auto" or manually input your own desired values.
yaml
{
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"total_num_steps": "auto",
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
}
}
Precision
Deepspeed supports fp32, fp16, and bf16 mixed precision.
If your model doesn't work well with mixed precision, for example if it wasn't pretrained in mixed precision, you may encounter overflow or underflow issues which can cause NaN loss. For these cases, you should use full fp32 precision by explicitly disabling the default fp16 mode.
yaml
{
"fp16": {
"enabled": false
}
}
For Ampere GPUs and PyTorch > 1.7, it automatically switches to the more efficient tf32 format for some operations but the results are still in fp32. You can control it from the [Trainer] by setting --tf32 to enable it, and --tf32 0 or --no_tf32 to disable it.
To configure PyTorch AMP-like fp16 mixed precision reduces memory usage and accelerates training speed. [Trainer] automatically enables or disables fp16 based on the value of args.fp16_backend, and the rest of the config can be set by you. fp16 is enabled from the command line when the following arguments are passed: --fp16, --fp16_backend amp or --fp16_full_eval.
yaml
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
}
}
For additional DeepSpeed fp16 training options, take a look at the FP16 Training Options reference.
To configure Apex-like fp16 mixed precision, setup the config as shown below with "auto" or your own values. [Trainer] automatically configure amp based on the values of args.fp16_backend and args.fp16_opt_level. It can also be enabled from the command line when the following arguments are passed: --fp16, --fp16_backend apex or --fp16_opt_level 01.
yaml
{
"amp": {
"enabled": "auto",
"opt_level": "auto"
}
}
To use bf16, you'll need at least DeepSpeed==0.6.0. bf16 has the same dynamic range as fp32 and doesn’t require loss scaling. However, if you use gradient accumulation with bf16, gradients are accumulated in bf16 which may not be desired because this format's low precision can lead to lossy accumulation.
bf16 can be setup in the config file or enabled from the command line when the following arguments are passed: --bf16 or --bf16_full_eval.
yaml
{
"bf16": {
"enabled": "auto"
}
}
Batch size
The batch size can be auto-configured or explicitly set. If you choose to use the "auto" option, [Trainer] sets train_micro_batch_size_per_gpu to the value of args.per_device_train_batch_size and train_batch_size to args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps.
yaml
{
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto"
}
Gradient accumulation
Gradient accumulation can be auto-configured or explicitly set. If you choose to use the "auto" option, [Trainer] sets it to the value of args.gradient_accumulation_steps.
```yaml
{
"gradient_accumulation_steps": "auto"
}
Gradient clipping
Gradient clipping can be auto-configured or explicitly set. If you choose to use the "auto" option, [Trainer] sets it to the value of args.max_grad_norm.
yaml
{
"gradient_clipping": "auto"
}
Communication data type
For communication collectives like reduction, gathering and scattering operations, a separate data type is used.
All gather and scatter operations are performed in the same data type the data is in. For example, if you're training with bf16, the data is also gathered in bf16 because gathering is a non-lossy operation.
Reduce operations are lossy, for example when gradients are averaged across multiple GPUs. When the communication is done in fp16 or bf16, it is more likely to be lossy because adding multiple numbers in low precision isn't exact. This is especially the case with bf16 which has a lower precision than fp16. For this reason, fp16 is the default for reduction operations because the loss is minimal when averaging gradients.
You can choose the communication data type by setting the communication_data_type parameter in the config file. For example, choosing fp32 adds a small amount of overhead but ensures the reduction operation is accumulated in fp32 and when it is ready, it is downcasted to whichever half-precision dtype you're training in.
yaml
{
"communication_data_type": "fp32"
}
Deployment
DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate. To deploy, add --deepspeed ds_config.json to the [Trainer] command line. It’s recommended to use DeepSpeed’s add_config_arguments utility to add any necessary command line arguments to your code.
This guide will show you how to deploy DeepSpeed with the deepspeed launcher for different training setups. You can check out this post for more practical usage examples.
To deploy DeepSpeed on multiple GPUs, add the --num_gpus parameter. If you want to use all available GPUs, you don't need to add --num_gpus. The example below uses 2 GPUs.
deepspeed --num_gpus=2 examples/pytorch/translation/run_translation.py \
--deepspeed tests/deepspeed/ds_config_zero3.json \
--model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro
To deploy DeepSpeed on a single GPU, add the --num_gpus parameter. It isn't necessary to explicitly set this value if you only have 1 GPU because DeepSpeed deploys all GPUs it can see on a given node.
deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
--deepspeed tests/deepspeed/ds_config_zero2.json \
--model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro
DeepSpeed is still useful with just 1 GPU because you can:
Offload some computations and memory to the CPU to make more GPU resources available to your model to use a larger batch size or fit a very large model that normally won't fit.
Minimize memory fragmentation with it's smart GPU memory management system which also allows you to fit bigger models and data batches.
Set the allgather_bucket_size and reduce_bucket_size values to 2e8 in the ZeRO-2 configuration file to get better performance on a single GPU.
Multi-node deployment
A node is one or more GPUs for running a workload. A more powerful setup is a multi-node setup which can be launched with the deepspeed launcher. For this guide, let's assume there are two nodes with 8 GPUs each. The first node can be accessed ssh hostname1 and the second node with ssh hostname2. Both nodes must be able to communicate with each other locally over ssh without a password.
By default, DeepSpeed expects your multi-node environment to use a shared storage. If this is not the case and each node can only see the local filesystem, you need to adjust the config file to include a checkpoint to allow loading without access to a shared filesystem:
yaml
{
"checkpoint": {
"use_node_local_storage": true
}
}
You could also use the [Trainer]'s --save_on_each_node argument to automatically add the above checkpoint to your config.
For torchrun, you have to ssh to each node and run the following command on both of them. The launcher waits until both nodes are synchronized before launching the training.
torchrun --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \
--master_port=9901 your_program.py <normal cl args> --deepspeed ds_config.json
For the deepspeed launcher, start by creating a hostfile.
hostname1 slots=8
hostname2 slots=8
Then you can launch the training with the following command. The deepspeed launcher automatically launches the command on both nodes at once.
deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 \
your_program.py <normal cl args> --deepspeed ds_config.json
Check out the Resource Configuration (multi-node) guide for more details about configuring multi-node compute resources.
SLURM
In a SLURM environment, you'll need to adapt your SLURM script to your specific SLURM environment. An example SLURM script may look like:
```bash
SBATCH --job-name=test-nodes # name
SBATCH --nodes=2 # nodes
SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
SBATCH --cpus-per-task=10 # number of cores per tasks
SBATCH --gres=gpu:8 # number of gpus
SBATCH --time 20:00:00 # maximum execution time (HH:MM:SS)
SBATCH --output=%x-%j.out # output file name
export GPUS_PER_NODE=8
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=9901
srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \
--nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \
--master_addr $MASTER_ADDR --master_port $MASTER_PORT \
your_program.py --deepspeed ds_config.json'
Then you can schedule your multi-node deployment with the following command which launches training simultaneously on all nodes.
sbatch launch.slurm
Notebook
The deepspeed launcher doesn't support deployment from a notebook so you'll need to emulate the distributed environment. However, this only works for 1 GPU. If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. This means you have to use the deepspeed launcher which can't be emulated as shown here.
DeepSpeed requires a distributed environment even when only one process is used.
This emulates a launcher in the notebook
import os
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
Now proceed as normal, plus pass the DeepSpeed config file
training_args = TrainingArguments(, deepspeed="ds_config_zero3.json")
trainer = Trainer()
trainer.train()
If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated cell.
%%bash
cat <<'EOT' > ds_config_zero3.json
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
EOT