Spaces:
Running
Running
File size: 16,501 Bytes
939262b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 |
"scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "nvme", "nvme_path": "/local_nvme", "pin_memory": true, "buffer_count": 4, "fast_init": false }, "offload_param": { "device": "nvme", "nvme_path": "/local_nvme", "pin_memory": true, "buffer_count": 5, "buffer_size": 1e8, "max_in_cpu": 1e9 }, "aio": { "block_size": 262144, "queue_depth": 32, "thread_count": 1, "single_submit": false, "overlap_events": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } DeepSpeed features There are a number of important parameters to specify in the DeepSpeed configuration file which are briefly described in this section. Activation/gradient checkpointing Activation and gradient checkpointing trades speed for more GPU memory which allows you to overcome scenarios where your GPU is out of memory or to increase your batch size for better performance. To enable this feature: For a Hugging Face model, set model.gradient_checkpointing_enable() or --gradient_checkpointing in the [Trainer]. For a non-Hugging Face model, use the DeepSpeed Activation Checkpointing API. You could also replace the Transformers modeling code and replace torch.utils.checkpoint with the DeepSpeed API. This approach is more flexible because you can offload the forward activations to the CPU memory instead of recalculating them. Optimizer and scheduler DeepSpeed and Transformers optimizer and scheduler can be mixed and matched as long as you don't enable offload_optimizer. When offload_optimizer is enabled, you could use a non-DeepSpeed optimizer (except for LAMB) as long as it has both a CPU and GPU implementation. The optimizer and scheduler parameters for the config file can be set from the command line to avoid hard to find errors. For example, if the learning rate is set to a different value in another place you can override it from the command line. Aside from the optimizer and scheduler parameters, you'll need to ensure your [Trainer] command line arguments match the DeepSpeed configuration. DeepSpeed offers several optimizers (Adam, AdamW, OneBitAdam, and LAMB) but you can also import other optimizers from PyTorch. If you don't configure the optimizer in the config, the [Trainer] automatically selects AdamW and either uses the supplied values or the default values for the following parameters from the command line: lr, adam_beta1, adam_beta2, adam_epsilon, weight_decay. You can set the parameters to "auto" or manually input your own desired values. yaml { "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } } } You can also use an unsupported optimizer by adding the following to the top level configuration. yaml { "zero_allow_untested_optimizer": true } From DeepSpeed==0.8.3 on, if you want to use offload, you'll also need to the following to the top level configuration because offload works best with DeepSpeed's CPU Adam optimizer. yaml { "zero_force_ds_cpu_optimizer": false } DeepSpeed supports the LRRangeTest, OneCycle, WarmupLR and WarmupDecayLR learning rate schedulers. Transformers and DeepSpeed provide two of the same schedulers: WarmupLR is the same as --lr_scheduler_type constant_with_warmup in Transformers WarmupDecayLR is the same as --lr_scheduler_type linear in Transformers (this is the default scheduler used in Transformers) If you don't configure the scheduler in the config, the [Trainer] automatically selects WarmupDecayLR and either uses the supplied values or the default values for the following parameters from the command line: warmup_min_lr, warmup_max_lr, warmup_num_steps, total_num_steps (automatically calculated during run time if max_steps is not provided). You can set the parameters to "auto" or manually input your own desired values. yaml { "scheduler": { "type": "WarmupDecayLR", "params": { "total_num_steps": "auto", "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } } } Precision Deepspeed supports fp32, fp16, and bf16 mixed precision. If your model doesn't work well with mixed precision, for example if it wasn't pretrained in mixed precision, you may encounter overflow or underflow issues which can cause NaN loss. For these cases, you should use full fp32 precision by explicitly disabling the default fp16 mode. yaml { "fp16": { "enabled": false } } For Ampere GPUs and PyTorch > 1.7, it automatically switches to the more efficient tf32 format for some operations but the results are still in fp32. You can control it from the [Trainer] by setting --tf32 to enable it, and --tf32 0 or --no_tf32 to disable it. To configure PyTorch AMP-like fp16 mixed precision reduces memory usage and accelerates training speed. [Trainer] automatically enables or disables fp16 based on the value of args.fp16_backend, and the rest of the config can be set by you. fp16 is enabled from the command line when the following arguments are passed: --fp16, --fp16_backend amp or --fp16_full_eval. yaml { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 } } For additional DeepSpeed fp16 training options, take a look at the FP16 Training Options reference. To configure Apex-like fp16 mixed precision, setup the config as shown below with "auto" or your own values. [Trainer] automatically configure amp based on the values of args.fp16_backend and args.fp16_opt_level. It can also be enabled from the command line when the following arguments are passed: --fp16, --fp16_backend apex or --fp16_opt_level 01. yaml { "amp": { "enabled": "auto", "opt_level": "auto" } } To use bf16, you'll need at least DeepSpeed==0.6.0. bf16 has the same dynamic range as fp32 and doesn’t require loss scaling. However, if you use gradient accumulation with bf16, gradients are accumulated in bf16 which may not be desired because this format's low precision can lead to lossy accumulation. bf16 can be setup in the config file or enabled from the command line when the following arguments are passed: --bf16 or --bf16_full_eval. yaml { "bf16": { "enabled": "auto" } } Batch size The batch size can be auto-configured or explicitly set. If you choose to use the "auto" option, [Trainer] sets train_micro_batch_size_per_gpu to the value of args.per_device_train_batch_size and train_batch_size to args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps. yaml { "train_micro_batch_size_per_gpu": "auto", "train_batch_size": "auto" } Gradient accumulation Gradient accumulation can be auto-configured or explicitly set. If you choose to use the "auto" option, [Trainer] sets it to the value of args.gradient_accumulation_steps. ```yaml { "gradient_accumulation_steps": "auto" } Gradient clipping Gradient clipping can be auto-configured or explicitly set. If you choose to use the "auto" option, [Trainer] sets it to the value of args.max_grad_norm. yaml { "gradient_clipping": "auto" } Communication data type For communication collectives like reduction, gathering and scattering operations, a separate data type is used. All gather and scatter operations are performed in the same data type the data is in. For example, if you're training with bf16, the data is also gathered in bf16 because gathering is a non-lossy operation. Reduce operations are lossy, for example when gradients are averaged across multiple GPUs. When the communication is done in fp16 or bf16, it is more likely to be lossy because adding multiple numbers in low precision isn't exact. This is especially the case with bf16 which has a lower precision than fp16. For this reason, fp16 is the default for reduction operations because the loss is minimal when averaging gradients. You can choose the communication data type by setting the communication_data_type parameter in the config file. For example, choosing fp32 adds a small amount of overhead but ensures the reduction operation is accumulated in fp32 and when it is ready, it is downcasted to whichever half-precision dtype you're training in. yaml { "communication_data_type": "fp32" } Deployment DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate. To deploy, add --deepspeed ds_config.json to the [Trainer] command line. It’s recommended to use DeepSpeed’s add_config_arguments utility to add any necessary command line arguments to your code. This guide will show you how to deploy DeepSpeed with the deepspeed launcher for different training setups. You can check out this post for more practical usage examples. To deploy DeepSpeed on multiple GPUs, add the --num_gpus parameter. If you want to use all available GPUs, you don't need to add --num_gpus. The example below uses 2 GPUs. deepspeed --num_gpus=2 examples/pytorch/translation/run_translation.py \ --deepspeed tests/deepspeed/ds_config_zero3.json \ --model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \ --output_dir output_dir --overwrite_output_dir --fp16 \ --do_train --max_train_samples 500 --num_train_epochs 1 \ --dataset_name wmt16 --dataset_config "ro-en" \ --source_lang en --target_lang ro To deploy DeepSpeed on a single GPU, add the --num_gpus parameter. It isn't necessary to explicitly set this value if you only have 1 GPU because DeepSpeed deploys all GPUs it can see on a given node. deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \ --deepspeed tests/deepspeed/ds_config_zero2.json \ --model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \ --output_dir output_dir --overwrite_output_dir --fp16 \ --do_train --max_train_samples 500 --num_train_epochs 1 \ --dataset_name wmt16 --dataset_config "ro-en" \ --source_lang en --target_lang ro DeepSpeed is still useful with just 1 GPU because you can: Offload some computations and memory to the CPU to make more GPU resources available to your model to use a larger batch size or fit a very large model that normally won't fit. Minimize memory fragmentation with it's smart GPU memory management system which also allows you to fit bigger models and data batches. Set the allgather_bucket_size and reduce_bucket_size values to 2e8 in the ZeRO-2 configuration file to get better performance on a single GPU. Multi-node deployment A node is one or more GPUs for running a workload. A more powerful setup is a multi-node setup which can be launched with the deepspeed launcher. For this guide, let's assume there are two nodes with 8 GPUs each. The first node can be accessed ssh hostname1 and the second node with ssh hostname2. Both nodes must be able to communicate with each other locally over ssh without a password. By default, DeepSpeed expects your multi-node environment to use a shared storage. If this is not the case and each node can only see the local filesystem, you need to adjust the config file to include a checkpoint to allow loading without access to a shared filesystem: yaml { "checkpoint": { "use_node_local_storage": true } } You could also use the [Trainer]'s --save_on_each_node argument to automatically add the above checkpoint to your config. For torchrun, you have to ssh to each node and run the following command on both of them. The launcher waits until both nodes are synchronized before launching the training. torchrun --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \ --master_port=9901 your_program.py <normal cl args> --deepspeed ds_config.json For the deepspeed launcher, start by creating a hostfile. hostname1 slots=8 hostname2 slots=8 Then you can launch the training with the following command. The deepspeed launcher automatically launches the command on both nodes at once. deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 \ your_program.py <normal cl args> --deepspeed ds_config.json Check out the Resource Configuration (multi-node) guide for more details about configuring multi-node compute resources. SLURM In a SLURM environment, you'll need to adapt your SLURM script to your specific SLURM environment. An example SLURM script may look like: ```bash SBATCH --job-name=test-nodes # name SBATCH --nodes=2 # nodes SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node! SBATCH --cpus-per-task=10 # number of cores per tasks SBATCH --gres=gpu:8 # number of gpus SBATCH --time 20:00:00 # maximum execution time (HH:MM:SS) SBATCH --output=%x-%j.out # output file name export GPUS_PER_NODE=8 export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) export MASTER_PORT=9901 srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \ --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \ --master_addr $MASTER_ADDR --master_port $MASTER_PORT \ your_program.py --deepspeed ds_config.json' Then you can schedule your multi-node deployment with the following command which launches training simultaneously on all nodes. sbatch launch.slurm Notebook The deepspeed launcher doesn't support deployment from a notebook so you'll need to emulate the distributed environment. However, this only works for 1 GPU. If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. This means you have to use the deepspeed launcher which can't be emulated as shown here. DeepSpeed requires a distributed environment even when only one process is used. This emulates a launcher in the notebook import os os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use os.environ["RANK"] = "0" os.environ["LOCAL_RANK"] = "0" os.environ["WORLD_SIZE"] = "1" Now proceed as normal, plus pass the DeepSpeed config file training_args = TrainingArguments(, deepspeed="ds_config_zero3.json") trainer = Trainer() trainer.train() If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated cell. %%bash cat <<'EOT' > ds_config_zero3.json { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } EOT |