Introduction || What is DDP || Single-Node Multi-GPU Training || Fault Tolerance || Multi-Node training || minGPT Training
Authors: Suraj Subramanian
Follow along with the video below or on youtube.
This series of video tutorials walks you through distributed training in PyTorch via DDP.
The series starts with a simple non-distributed training job, and ends with deploying a training job across several machines in a cluster. Along the way, you will also learn about torchrun for fault-tolerant distributed training.
The tutorial assumes a basic familiarity with model training in PyTorch.
You will need multiple CUDA GPUs to run the tutorial code. Typically, this can be done on a cloud instance with multiple GPUs (the tutorials use an Amazon EC2 P3 instance with 4 GPUs).
The tutorial code is hosted in this github repo. Clone the repository and follow along!
- Introduction (this page)
- What is DDP? Gently introduces what DDP is doing under the hood
- Single-Node Multi-GPU Training Training models using multiple GPUs on a single machine
- Fault-tolerant distributed training Making your distributed training job robust with torchrun
- Multi-Node training Training models using multiple GPUs on multiple machines
- Training a GPT model with DDP “Real-world” example of training a minGPT model with DDP