File size: 16,488 Bytes
939262b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

}
EOT

If the training script is in a file and not in a notebook cell, you can launch deepspeed normally from the shell in a notebook cell. For example, to launch run_translation.py:
py
!git clone https://github.com/huggingface/transformers
!cd transformers; deepspeed examples/pytorch/translation/run_translation.py 
You could also use %%bash magic and write multi-line code to run the shell program, but you won't be able to view the logs until training is complete. With %%bash magic, you don't need to emulate a distributed environment.

%%bash
git clone https://github.com/huggingface/transformers
cd transformers
deepspeed examples/pytorch/translation/run_translation.py 

Save model weights
DeepSpeed stores the main full precision fp32 weights in custom checkpoint optimizer files (the glob pattern looks like global_step*/*optim_states.pt) and are saved under the normal checkpoint.

A model trained with ZeRO-2 saves the pytorch_model.bin weights in fp16. To save the model weights in fp16 for a model trained with ZeRO-3, you need to set "stage3_gather_16bit_weights_on_model_save": true because the model weights are partitioned across multiple GPUs. Otherwise, the [Trainer] won't save the weights in fp16 and it won't create a pytorch_model.bin file. This is because DeepSpeed's state_dict contains a placeholder instead of the real weights and you won't be able to load them.
yaml
{
    "zero_optimization": {
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

The full precision weights shouldn't be saved during training because it can require a lot of memory. It is usually best to save the fp32 weights offline after training is complete. But if you have a lot of free CPU memory, it is possible to save the fp32 weights during training. This section covers both online and offline approaches.
Online
You must have saved at least one checkpoint to load the latest checkpoint as shown in the following:

from transformers.trainer_utils import get_last_checkpoint
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)

If you've enabled the --load_best_model_at_end parameter to track the best checkpoint in [TrainingArguments], you can finish training first and save the final model explicitly. Then you can reload it as shown below:

from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
trainer.deepspeed.save_checkpoint(checkpoint_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)

Once load_state_dict_from_zero_checkpoint is run, the model is no longer usable in DeepSpeed in the context of the same application. You'll need to initialize the DeepSpeed engine again since model.load_state_dict(state_dict) removes all the DeepSpeed magic from it. Only use this at the very end of training.

You can also extract and load the state_dict of the fp32 weights:

from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir)  # already on cpu
model = model.cpu()
model.load_state_dict(state_dict)

Offline
DeepSpeed provides a zero_to_fp32.py script at the top-level of the checkpoint folder for extracting weights at any point. This is a standalone script and you don't need a configuration file or [Trainer].
For example, if your checkpoint folder looked like this:

$ ls -l output_dir/checkpoint-1/
-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json
drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/
-rw-rw-r-- 1 stas stas   12 Mar 27 13:16 latest
-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt
-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin
-rw-rw-r-- 1 stas stas  623 Mar 27 20:42 scheduler.pt
-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json
-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model
-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json
-rw-rw-r-- 1 stas stas  339 Mar 27 20:42 trainer_state.json
-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin
-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py*
To reconstruct the fp32 weights from the DeepSpeed checkpoint (ZeRO-2 or ZeRO-3) subfolder global_step1, run the following command to create and consolidate the full fp32 weights from multiple GPUs into a single pytorch_model.bin file. The script automatically discovers the subfolder containing the checkpoint.
py
python zero_to_fp32.py . pytorch_model.bin

Run python zero_to_fp32.py -h for more usage details. The script requires 2x the general RAM of the final fp32 weights.

ZeRO Inference
ZeRO Inference places the model weights in CPU or NVMe memory to avoid burdening the GPU which makes it possible to run inference with huge models on a GPU. Inference doesn't require any large additional amounts of memory for the optimizer states and gradients so you can fit much larger batches and/or sequence lengths on the same hardware.
ZeRO Inference shares the same configuration file as ZeRO-3, and ZeRO-2 and ZeRO-1 configs won't work because they don't provide any benefits for inference.
To run ZeRO Inference, pass your usual training arguments to the [TrainingArguments] class and add the --do_eval argument.

deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json
Non-Trainer DeepSpeed integration
DeepSpeed also works with Transformers without the [Trainer] class. This is handled by the [HfDeepSpeedConfig] which only takes care of gathering ZeRO-3 parameters and splitting a model across multiple GPUs when you call [~PreTrainedModel.from_pretrained].

If you want everything automatically taken care of for you, try using DeepSpeed with the [Trainer]! You'll need to follow the DeepSpeed documentation, and manually configure the parameter values in the config file (you can't use the "auto" value).

To efficiently deploy ZeRO-3, you must instantiate the [HfDeepSpeedConfig] object before the model and keep that object alive:

from transformers.integrations import HfDeepSpeedConfig
from transformers import AutoModel
import deepspeed
ds_config = {}  # deepspeed config object or path to the file
must run before instantiating the model to detect zero 3
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
model = AutoModel.from_pretrained("openai-community/gpt2")
engine = deepspeed.initialize(model=model, config_params=ds_config, )

[HfDeepSpeedConfig] is not required for ZeRO-1 or ZeRO-2.

from transformers.integrations import HfDeepSpeedConfig
from transformers import AutoModel, AutoConfig
import deepspeed
ds_config = {}  # deepspeed config object or path to the file
must run before instantiating the model to detect zero 3
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
config = AutoConfig.from_pretrained("openai-community/gpt2")
model = AutoModel.from_config(config)
engine = deepspeed.initialize(model=model, config_params=ds_config, )

Non-Trainer ZeRO Inference
To run ZeRO Inference without the [Trainer] in cases where you can’t fit a model onto a single GPU, try using additional GPUs or/and offloading to CPU memory. The important nuance to understand here is that the way ZeRO is designed, you can process different inputs on different GPUs in parallel.
Make sure to:

disable CPU offload if you have enough GPU memory (since it slows things down).
enable bf16 if you have an Ampere or newer GPU to make things faster. If you don’t have one of these GPUs, you may enable fp16 as long as you don’t use a model pretrained in bf16 (T5 models) because it may lead to an overflow error.

Take a look at the following script to get a better idea of how to run ZeRO Inference without the [Trainer] on a model that won't fit on a single GPU.

!/usr/bin/env python
This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
into a single GPU

1. Use 1 GPU with CPU offload
2. Or use multiple GPUs instead

First you need to install deepspeed: pip install deepspeed

Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
small GPUs can handle it. or 1 small GPU and a lot of CPU memory.

To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
process multiple inputs at once.

The provided deepspeed config also activates CPU memory offloading, so chances are that if you
have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
run faster if you don't want offload to CPU - so disable that section then.

To deploy on 1 gpu:

deepspeed --num_gpus 1 t0.py
or:
python -m torch.distributed.run --nproc_per_node=1 t0.py

To deploy on 2 gpus:

deepspeed --num_gpus 2 t0.py
or:
python -m torch.distributed.run --nproc_per_node=2 t0.py
from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM
from transformers.integrations import HfDeepSpeedConfig
import deepspeed
import os
import torch
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # To avoid warnings about parallelism in tokenizers
distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()
model_name = "bigscience/T0_3B"
config = AutoConfig.from_pretrained(model_name)
model_hidden_size = config.d_model
batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size
ds_config notes

- enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
faster.

- for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
all official t5 models are bf16-pretrained

- set offload_param.device to "none" or completely remove the offload_param section if you don't
- want CPU offload

- if using offload_param you can manually finetune stage3_param_persistence_threshold to control
- which params should remain on gpus - the larger the value the smaller the offload size

For in-depth info on Deepspeed config see
https://huggingface.co./docs/transformers/main/main_classes/deepspeed
keeping the same format as json for consistency, except it uses lower case for true/false
fmt: off
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
fmt: on
next line instructs transformers to partition the model directly over multiple gpus using
deepspeed.zero.Init when model's from_pretrained method is called.

it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)

otherwise the model will first be loaded normally and only partitioned at forward time which is
less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
now a model can be loaded.
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference
Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
If you use more GPUs adjust for more.
And of course if you have just one input to process you then need to pass the same string to both gpus
If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
    text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

Save the script as t0.py and launch it:

$ deepspeed --num_gpus 2 t0.py
rank0:
   in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy
  out=Positive
rank1:
   in=Is this review positive or negative? Review: this is the worst restaurant ever
  out=negative
This is a very basic example and you'll want to adapt it to your use case.
Generate
Using multiple GPUs with ZeRO-3 for generation requires synchronizing the GPUs by setting synced_gpus=True in the [~GenerationMixin.generate] method. Otherwise, if one GPU is finished generating before another one, the whole system hangs because the remaining GPUs haven't received the weight shard from the GPU that finished first.
For Transformers>=4.28, if synced_gpus is automatically set to True if multiple GPUs are detected during generation.
Troubleshoot
When you encounter an issue, you should consider whether DeepSpeed is the cause of the problem because often it isn't (unless it's super obviously and you can see DeepSpeed modules in the exception)! The first step should be to retry your setup without DeepSpeed, and if the problem persists, then you can report the issue. If the issue is a core DeepSpeed problem and unrelated to the Transformers integration, open an Issue on the DeepSpeed repository.
For issues related to the Transformers integration, please provide the following information:

the full DeepSpeed config file

the command line arguments of the [Trainer], or [TrainingArguments] arguments if you're scripting the [Trainer] setup yourself (don't dump the [TrainingArguments] which has dozens of irrelevant entries)

the outputs of:

python -c 'import torch; print(f"torch: {torch.__version__}")'
python -c 'import transformers; print(f"transformers: {transformers.__version__}")'
python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")'

a link to a Google Colab notebook to reproduce the issue

if impossible, a standard and non-custom dataset we can use and also try to use an existing example to reproduce the issue with

The following sections provide a guide for resolving two of the most common issues.
DeepSpeed process killed at startup
When the DeepSpeed process is killed during launch without a traceback, that usually means the program tried to allocate more CPU memory than your system has or your process tried to allocate more CPU memory than allowed leading the OS kernel to terminate the process. In this case, check whether your configuration file has either offload_optimizer, offload_param or both configured to offload to the CPU. 
If you have NVMe and ZeRO-3 setup, experiment with offloading to the NVMe (estimate the memory requirements for your model).
NaN loss
NaN loss often occurs when a model is pretrained in bf16 and then you try to use it with fp16 (especially relevant for TPU trained models). To resolve this, use fp32 or bf16 if your hardware supports it (TPU, Ampere GPUs or newer).
The other issue may be related to using fp16. For example, if this is your fp16 configuration:
yaml
{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    }
}
You might see the following OVERFLOW! messages in the logs: