Spaces:
Running
Running
File size: 16,488 Bytes
939262b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 |
"gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } EOT If the training script is in a file and not in a notebook cell, you can launch deepspeed normally from the shell in a notebook cell. For example, to launch run_translation.py: py !git clone https://github.com/huggingface/transformers !cd transformers; deepspeed examples/pytorch/translation/run_translation.py You could also use %%bash magic and write multi-line code to run the shell program, but you won't be able to view the logs until training is complete. With %%bash magic, you don't need to emulate a distributed environment. %%bash git clone https://github.com/huggingface/transformers cd transformers deepspeed examples/pytorch/translation/run_translation.py Save model weights DeepSpeed stores the main full precision fp32 weights in custom checkpoint optimizer files (the glob pattern looks like global_step*/*optim_states.pt) and are saved under the normal checkpoint. A model trained with ZeRO-2 saves the pytorch_model.bin weights in fp16. To save the model weights in fp16 for a model trained with ZeRO-3, you need to set "stage3_gather_16bit_weights_on_model_save": true because the model weights are partitioned across multiple GPUs. Otherwise, the [Trainer] won't save the weights in fp16 and it won't create a pytorch_model.bin file. This is because DeepSpeed's state_dict contains a placeholder instead of the real weights and you won't be able to load them. yaml { "zero_optimization": { "stage3_gather_16bit_weights_on_model_save": true } } The full precision weights shouldn't be saved during training because it can require a lot of memory. It is usually best to save the fp32 weights offline after training is complete. But if you have a lot of free CPU memory, it is possible to save the fp32 weights during training. This section covers both online and offline approaches. Online You must have saved at least one checkpoint to load the latest checkpoint as shown in the following: from transformers.trainer_utils import get_last_checkpoint from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint checkpoint_dir = get_last_checkpoint(trainer.args.output_dir) fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) If you've enabled the --load_best_model_at_end parameter to track the best checkpoint in [TrainingArguments], you can finish training first and save the final model explicitly. Then you can reload it as shown below: from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final") trainer.deepspeed.save_checkpoint(checkpoint_dir) fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) Once load_state_dict_from_zero_checkpoint is run, the model is no longer usable in DeepSpeed in the context of the same application. You'll need to initialize the DeepSpeed engine again since model.load_state_dict(state_dict) removes all the DeepSpeed magic from it. Only use this at the very end of training. You can also extract and load the state_dict of the fp32 weights: from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu model = model.cpu() model.load_state_dict(state_dict) Offline DeepSpeed provides a zero_to_fp32.py script at the top-level of the checkpoint folder for extracting weights at any point. This is a standalone script and you don't need a configuration file or [Trainer]. For example, if your checkpoint folder looked like this: $ ls -l output_dir/checkpoint-1/ -rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/ -rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest -rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt -rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin -rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt -rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json -rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model -rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json -rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json -rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin -rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py* To reconstruct the fp32 weights from the DeepSpeed checkpoint (ZeRO-2 or ZeRO-3) subfolder global_step1, run the following command to create and consolidate the full fp32 weights from multiple GPUs into a single pytorch_model.bin file. The script automatically discovers the subfolder containing the checkpoint. py python zero_to_fp32.py . pytorch_model.bin Run python zero_to_fp32.py -h for more usage details. The script requires 2x the general RAM of the final fp32 weights. ZeRO Inference ZeRO Inference places the model weights in CPU or NVMe memory to avoid burdening the GPU which makes it possible to run inference with huge models on a GPU. Inference doesn't require any large additional amounts of memory for the optimizer states and gradients so you can fit much larger batches and/or sequence lengths on the same hardware. ZeRO Inference shares the same configuration file as ZeRO-3, and ZeRO-2 and ZeRO-1 configs won't work because they don't provide any benefits for inference. To run ZeRO Inference, pass your usual training arguments to the [TrainingArguments] class and add the --do_eval argument. deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json Non-Trainer DeepSpeed integration DeepSpeed also works with Transformers without the [Trainer] class. This is handled by the [HfDeepSpeedConfig] which only takes care of gathering ZeRO-3 parameters and splitting a model across multiple GPUs when you call [~PreTrainedModel.from_pretrained]. If you want everything automatically taken care of for you, try using DeepSpeed with the [Trainer]! You'll need to follow the DeepSpeed documentation, and manually configure the parameter values in the config file (you can't use the "auto" value). To efficiently deploy ZeRO-3, you must instantiate the [HfDeepSpeedConfig] object before the model and keep that object alive: from transformers.integrations import HfDeepSpeedConfig from transformers import AutoModel import deepspeed ds_config = {} # deepspeed config object or path to the file must run before instantiating the model to detect zero 3 dschf = HfDeepSpeedConfig(ds_config) # keep this object alive model = AutoModel.from_pretrained("openai-community/gpt2") engine = deepspeed.initialize(model=model, config_params=ds_config, ) [HfDeepSpeedConfig] is not required for ZeRO-1 or ZeRO-2. from transformers.integrations import HfDeepSpeedConfig from transformers import AutoModel, AutoConfig import deepspeed ds_config = {} # deepspeed config object or path to the file must run before instantiating the model to detect zero 3 dschf = HfDeepSpeedConfig(ds_config) # keep this object alive config = AutoConfig.from_pretrained("openai-community/gpt2") model = AutoModel.from_config(config) engine = deepspeed.initialize(model=model, config_params=ds_config, ) Non-Trainer ZeRO Inference To run ZeRO Inference without the [Trainer] in cases where you can’t fit a model onto a single GPU, try using additional GPUs or/and offloading to CPU memory. The important nuance to understand here is that the way ZeRO is designed, you can process different inputs on different GPUs in parallel. Make sure to: disable CPU offload if you have enough GPU memory (since it slows things down). enable bf16 if you have an Ampere or newer GPU to make things faster. If you don’t have one of these GPUs, you may enable fp16 as long as you don’t use a model pretrained in bf16 (T5 models) because it may lead to an overflow error. Take a look at the following script to get a better idea of how to run ZeRO Inference without the [Trainer] on a model that won't fit on a single GPU. !/usr/bin/env python This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model into a single GPU 1. Use 1 GPU with CPU offload 2. Or use multiple GPUs instead First you need to install deepspeed: pip install deepspeed Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2 small GPUs can handle it. or 1 small GPU and a lot of CPU memory. To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU - you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to process multiple inputs at once. The provided deepspeed config also activates CPU memory offloading, so chances are that if you have a lot of available CPU memory and you don't mind a slowdown you should be able to load a model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will run faster if you don't want offload to CPU - so disable that section then. To deploy on 1 gpu: deepspeed --num_gpus 1 t0.py or: python -m torch.distributed.run --nproc_per_node=1 t0.py To deploy on 2 gpus: deepspeed --num_gpus 2 t0.py or: python -m torch.distributed.run --nproc_per_node=2 t0.py from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM from transformers.integrations import HfDeepSpeedConfig import deepspeed import os import torch os.environ["TOKENIZERS_PARALLELISM"] = "false" # To avoid warnings about parallelism in tokenizers distributed setup local_rank = int(os.getenv("LOCAL_RANK", "0")) world_size = int(os.getenv("WORLD_SIZE", "1")) torch.cuda.set_device(local_rank) deepspeed.init_distributed() model_name = "bigscience/T0_3B" config = AutoConfig.from_pretrained(model_name) model_hidden_size = config.d_model batch size has to be divisible by world_size, but can be bigger than world_size train_batch_size = 1 * world_size ds_config notes - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be faster. - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g. all official t5 models are bf16-pretrained - set offload_param.device to "none" or completely remove the offload_param section if you don't - want CPU offload - if using offload_param you can manually finetune stage3_param_persistence_threshold to control - which params should remain on gpus - the larger the value the smaller the offload size For in-depth info on Deepspeed config see https://huggingface.co./docs/transformers/main/main_classes/deepspeed keeping the same format as json for consistency, except it uses lower case for true/false fmt: off ds_config = { "fp16": { "enabled": False }, "bf16": { "enabled": False }, "zero_optimization": { "stage": 3, "offload_param": { "device": "cpu", "pin_memory": True }, "overlap_comm": True, "contiguous_gradients": True, "reduce_bucket_size": model_hidden_size * model_hidden_size, "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size, "stage3_param_persistence_threshold": 10 * model_hidden_size }, "steps_per_print": 2000, "train_batch_size": train_batch_size, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": False } fmt: on next line instructs transformers to partition the model directly over multiple gpus using deepspeed.zero.Init when model's from_pretrained method is called. it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name) otherwise the model will first be loaded normally and only partitioned at forward time which is less efficient and when there is little CPU RAM may fail dschf = HfDeepSpeedConfig(ds_config) # keep this object alive now a model can be loaded. model = AutoModelForSeq2SeqLM.from_pretrained(model_name) initialise Deepspeed ZeRO and store only the engine object ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0] ds_engine.module.eval() # inference Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once. If you use more GPUs adjust for more. And of course if you have just one input to process you then need to pass the same string to both gpus If you use only one GPU, then you will have only rank 0. rank = torch.distributed.get_rank() if rank == 0: text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy" elif rank == 1: text_in = "Is this review positive or negative? Review: this is the worst restaurant ever" tokenizer = AutoTokenizer.from_pretrained(model_name) inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank) with torch.no_grad(): outputs = ds_engine.module.generate(inputs, synced_gpus=True) text_out = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"rank{rank}:\n in={text_in}\n out={text_out}") Save the script as t0.py and launch it: $ deepspeed --num_gpus 2 t0.py rank0: in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy out=Positive rank1: in=Is this review positive or negative? Review: this is the worst restaurant ever out=negative This is a very basic example and you'll want to adapt it to your use case. Generate Using multiple GPUs with ZeRO-3 for generation requires synchronizing the GPUs by setting synced_gpus=True in the [~GenerationMixin.generate] method. Otherwise, if one GPU is finished generating before another one, the whole system hangs because the remaining GPUs haven't received the weight shard from the GPU that finished first. For Transformers>=4.28, if synced_gpus is automatically set to True if multiple GPUs are detected during generation. Troubleshoot When you encounter an issue, you should consider whether DeepSpeed is the cause of the problem because often it isn't (unless it's super obviously and you can see DeepSpeed modules in the exception)! The first step should be to retry your setup without DeepSpeed, and if the problem persists, then you can report the issue. If the issue is a core DeepSpeed problem and unrelated to the Transformers integration, open an Issue on the DeepSpeed repository. For issues related to the Transformers integration, please provide the following information: the full DeepSpeed config file the command line arguments of the [Trainer], or [TrainingArguments] arguments if you're scripting the [Trainer] setup yourself (don't dump the [TrainingArguments] which has dozens of irrelevant entries) the outputs of: python -c 'import torch; print(f"torch: {torch.__version__}")' python -c 'import transformers; print(f"transformers: {transformers.__version__}")' python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")' a link to a Google Colab notebook to reproduce the issue if impossible, a standard and non-custom dataset we can use and also try to use an existing example to reproduce the issue with The following sections provide a guide for resolving two of the most common issues. DeepSpeed process killed at startup When the DeepSpeed process is killed during launch without a traceback, that usually means the program tried to allocate more CPU memory than your system has or your process tried to allocate more CPU memory than allowed leading the OS kernel to terminate the process. In this case, check whether your configuration file has either offload_optimizer, offload_param or both configured to offload to the CPU. If you have NVMe and ZeRO-3 setup, experiment with offloading to the NVMe (estimate the memory requirements for your model). NaN loss NaN loss often occurs when a model is pretrained in bf16 and then you try to use it with fp16 (especially relevant for TPU trained models). To resolve this, use fp32 or bf16 if your hardware supports it (TPU, Ampere GPUs or newer). The other issue may be related to using fp16. For example, if this is your fp16 configuration: yaml { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 } } You might see the following OVERFLOW! messages in the logs: |