|
252 | 252 | "\n",
|
253 | 253 | "Initially, we want to give the current weights of the model to every GPU that we are using. To do so, we will **broadcast** the necessary tensors.\n",
|
254 | 254 | "\n",
|
255 |
| - "Then, each GPU will collect a subset of the full batch, lets say only 64 out of 256 samples, from memory and perform a forward pass of the model. At the end, we need to compute the loss over the entire batch of 256 samples, but no GPU can fit all of these. Here, the **reduction** primitive comes to the resque. The tensors that reside in different GPUs are collected and an operation is performed that will *reduce* the tensors to a single one. This allows for the result of the operation to still fit in memory. We may want to keep thisresult in a single GPU (using **reduce**) or send it to all of them (using **all_reduce**).\n", |
| 255 | + "Then, each GPU will collect a subset of the full batch, lets say only 64 out of 256 samples, from memory and perform a forward pass of the model. At the end, we need to compute the loss over the entire batch of 256 samples, but no GPU can fit all of these. Here, the **reduction** primitive comes to the rescue. The tensors that reside in different GPUs are collected and an operation is performed that will *reduce* the tensors to a single one. This allows for the result of the operation to still fit in memory. We may want to keep this result in a single GPU (using **reduce**) or send it to all of them (using **all_reduce**).\n", |
256 | 256 | "\n",
|
257 | 257 | "The operations that we can perform are determined by the backend that we are currently using. When using `nccl`, the list of available operations is the following:\n",
|
258 | 258 | " - `SUM`\n",
|
|
389 | 389 | "source": [
|
390 | 390 | "### All Gather\n",
|
391 | 391 | "\n",
|
392 |
| - "The **all gather** operation allows for all GPUs to have access to all the data processed by the others. This can be expecially useful when different operations need to be performed by each GPU, after a common operation has been performed on each subset of the data. It is important to note that the entirety of the data needs to fit in a single GPU, so here the bottleneck won't be the memory, instead, it will be the processing speed. \n", |
| 392 | + "The **all gather** operation allows for all GPUs to have access to all the data processed by the others. This can be especially useful when different operations need to be performed by each GPU, after a common operation has been performed on each subset of the data. It is important to note that the entirety of the data needs to fit in a single GPU, so here the bottleneck won't be the memory, instead, it will be the processing speed. \n", |
393 | 393 | "\n",
|
394 | 394 | "**Place a disk in the rank 0 device and a square in the rank 1 device, then create a list of empty tensors and use it as target for the all gather operation.**\n",
|
395 | 395 | "\n",
|
|
448 | 448 | "The first thing we do is to spawn the two processes. \n",
|
449 | 449 | "In each, we begin by initializing the distributed processing environment.\n",
|
450 | 450 | "\n",
|
451 |
| - "Then, the datasets needs to be downloaded. Here, I assume that it has not been downloaded yet, and I only let the GPU in rank 0 perform this operation. This avoids having two processes writing in the same file. In order to have the other process wait for the first one to download, a **barrier** is used. The working principle is very simple, when a barrier is reached in the code, the process waits for all other processes to also reach that point in the code. Here we see how this can be a very useful construct in parallel computing, all processes require the dataset to be downloaded before proceding, so one of them starts the download, and all wait until it's done.\n", |
| 451 | + "Then, the datasets needs to be downloaded. Here, I assume that it has not been downloaded yet, and I only let the GPU in rank 0 perform this operation. This avoids having two processes writing in the same file. In order to have the other process wait for the first one to download, a **barrier** is used. The working principle is very simple, when a barrier is reached in the code, the process waits for all other processes to also reach that point in the code. Here we see how this can be a very useful construct in parallel computing, all processes require the dataset to be downloaded before proceeding, so one of them starts the download, and all wait until it's done.\n", |
452 | 452 | "\n",
|
453 | 453 | "Then we initialize the weights, only in the rank 0 GPU, and **broadcast** them to all other GPUs. This broadcast operation is performed asynchronously, to allow for the rank 0 GPU to start loading images before the rank 1 has received the weights. This operation is akin to what DataParallel does, which is slowing the processing of the other GPUs down, waiting to receive the weights from the root GPU.\n",
|
454 | 454 | "\n",
|
455 | 455 | "<center width=\"100%\"><img style=\"margin:0 auto\" src=\"assets/example_broadcast.png\" /></center>\n",
|
456 | 456 | "\n",
|
457 |
| - "Each GPU will then load the images from disk, perform a product to find the activations of the next layer and caculate a softmax to get class-belonging probabilities. \n", |
| 457 | + "Each GPU will then load the images from disk, perform a product to find the activations of the next layer and calculate a softmax to get class-belonging probabilities. \n", |
458 | 458 | "\n",
|
459 | 459 | "Finally, the loss is computed by summing over the dimensions and a **reduction** with sum is performed to compute the overall loss over the entire batch.\n",
|
460 | 460 | "\n",
|
|
576 | 576 | "\n",
|
577 | 577 | "We have seen how we can use these collectives to perform a calculation of the loss of a neural network, but the same can be extended to any type of parallelizable computation.\n",
|
578 | 578 | "\n",
|
579 |
| - "Finally, we saw how simple it is to set a PyTorch Lighnining training to use multiple GPUs.\n", |
| 579 | + "Finally, we saw how simple it is to set a PyTorch Lightning training to use multiple GPUs.\n", |
580 | 580 | "\n",
|
581 | 581 | "### References\n",
|
582 | 582 | "\n",
|
|
0 commit comments