Spaces:

thenativefox
/

RAG

Running

RAG / documentation /_perf_hardware.txt

thenativefox

Added split files and tables

939262b 10 months ago

6.37 kB


	Custom hardware for training
	The hardware you use to run model training and inference can have a big effect on performance. For a deep dive into GPUs make sure to check out Tim Dettmer's excellent blog post.
	Let's have a look at some practical advice for GPU setups.
	GPU
	When you train bigger models you have essentially three options:

	bigger GPUs
	more GPUs
	more CPU and NVMe (offloaded to by DeepSpeed-Infinity)

	Let's start at the case where you have a single GPU.
	Power and Cooling
	If you bought an expensive high end GPU make sure you give it the correct power and sufficient cooling.
	Power:
	Some high end consumer GPU cards have 2 and sometimes 3 PCI-E 8-Pin power sockets. Make sure you have as many independent 12V PCI-E 8-Pin cables plugged into the card as there are sockets. Do not use the 2 splits at one end of the same cable (also known as pigtail cable). That is if you have 2 sockets on the GPU, you want 2 PCI-E 8-Pin cables going from your PSU to the card and not one that has 2 PCI-E 8-Pin connectors at the end! You won't get the full performance out of your card otherwise.
	Each PCI-E 8-Pin power cable needs to be plugged into a 12V rail on the PSU side and can supply up to 150W of power.
	Some other cards may use a PCI-E 12-Pin connectors, and these can deliver up to 500-600W of power.
	Low end cards may use 6-Pin connectors, which supply up to 75W of power.
	Additionally you want the high-end PSU that has stable voltage. Some lower quality ones may not give the card the stable voltage it needs to function at its peak.
	And of course the PSU needs to have enough unused Watts to power the card.
	Cooling:
	When a GPU gets overheated it will start throttling down and will not deliver full performance and it can even shutdown if it gets too hot.
	It's hard to tell the exact best temperature to strive for when a GPU is heavily loaded, but probably anything under +80C is good, but lower is better - perhaps 70-75C is an excellent range to be in. The throttling down is likely to start at around 84-90C. But other than throttling performance a prolonged very high temperature is likely to reduce the lifespan of a GPU.
	Next let's have a look at one of the most important aspects when having multiple GPUs: connectivity.
	Multi-GPU Connectivity
	If you use multiple GPUs the way cards are inter-connected can have a huge impact on the total training time. If the GPUs are on the same physical node, you can run:

	nvidia-smi topo -m
	and it will tell you how the GPUs are inter-connected. On a machine with dual-GPU and which are connected with NVLink, you will most likely see something like:
	GPU0 GPU1 CPU Affinity NUMA Affinity
	GPU0 X NV2 0-23 N/A
	GPU1 NV2 X 0-23 N/A
	on a different machine w/o NVLink we may see:
	GPU0 GPU1 CPU Affinity NUMA Affinity
	GPU0 X PHB 0-11 N/A
	GPU1 PHB X 0-11 N/A
	The report includes this legend:
	X = Self
	SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
	NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
	PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
	PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
	PIX = Connection traversing at most a single PCIe bridge
	NV# = Connection traversing a bonded set of # NVLinks
	So the first report NV2 tells us the GPUs are interconnected with 2 NVLinks, and the second report PHB we have a typical consumer-level PCIe+Bridge setup.
	Check what type of connectivity you have on your setup. Some of these will make the communication between cards faster (e.g. NVLink), others slower (e.g. PHB).
	Depending on the type of scalability solution used, the connectivity speed could have a major or a minor impact. If the GPUs need to sync rarely, as in DDP, the impact of a slower connection will be less significant. If the GPUs need to send messages to each other often, as in ZeRO-DP, then faster connectivity becomes super important to achieve faster training.
	NVlink
	NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia.
	Each new generation provides a faster bandwidth, e.g. here is a quote from Nvidia Ampere GA102 GPU Architecture:

	Third-Generation NVLink®
	GA102 GPUs utilize NVIDIA’s third-generation NVLink interface, which includes four x4 links,
	with each link providing 14.0625 GB/sec bandwidth in each direction between two GPUs. Four
	links provide 56.25 GB/sec bandwidth in each direction, and 112.5 GB/sec total bandwidth
	between two GPUs. Two RTX 3090 GPUs can be connected together for SLI using NVLink.
	(Note that 3-Way and 4-Way SLI configurations are not supported.)

	So the higher X you get in the report of NVX in the output of nvidia-smi topo -m the better. The generation will depend on your GPU architecture.
	Let's compare the execution of a openai-community/gpt2 language model training over a small sample of wikitext.
	The results are:
	\| NVlink \| Time \|
	\| ----- \| ---: \|
	\| Y \| 101s \|
	\| N \| 131s \|
	You can see that NVLink completes the training ~23% faster. In the second benchmark we use NCCL_P2P_DISABLE=1 to tell the GPUs not to use NVLink.
	Here is the full benchmark code and outputs:
	```bash
	DDP w/ NVLink
	rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \
	--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \
	--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \
	--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
	{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69}
	DDP w/o NVLink
	rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \
	--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \
	--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train
	--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
	{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69}

	Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2 in nvidia-smi topo -m)
	Software: pytorch-1.8-to-be + cuda-11.0 / transformers==4.3.0.dev0