Fix run_demo(demo_model_parallel, world_size) issue (#2367)

TheMemoryDealer · web-flow · commit 420037e77a0d · 2023-06-02T13:27:30.000-07:00
In the function demo_model_parallel, dev0 and dev1 are computed in a way that assigns two distinct GPUs to each process. This is achieved by doubling the rank and applying modulus operation with twice the world_size. Assuming 8 gpus world_size is set to 4, leading to the creation of 4 processes. Each of these processes is allocated two distinct GPUs. For instance, the first process (process 0) is assigned GPUs 0 and 1, the second process (process 1) is assigned GPUs 2 and 3, and so forth.
diff --git a/intermediate_source/ddp_tutorial.rst b/intermediate_source/ddp_tutorial.rst
@@ -269,8 +269,8 @@ either the application or the model ``forward()`` method.
         setup(rank, world_size)
 
         # setup mp_model and devices for this process
-        dev0 = (rank * 2) % world_size
-        dev1 = (rank * 2 + 1) % world_size
+        dev0 = rank * 2
+        dev1 = rank * 2 + 1
         mp_model = ToyMpModel(dev0, dev1)
         ddp_mp_model = DDP(mp_model)
 
@@ -293,6 +293,7 @@ either the application or the model ``forward()`` method.
         world_size = n_gpus
         run_demo(demo_basic, world_size)
         run_demo(demo_checkpoint, world_size)
+        world_size = n_gpus//2
         run_demo(demo_model_parallel, world_size)
 
 Initialize DDP with torch.distributed.run/torchrun