You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this tutorial, there are three demos on Distributed Training. The last one is the Model Parallel use case. In the last code block, under the if statement, the world_size parameter of the function call run_demo(demo_model_parallel, world_size) has to be world_size//2 because in the Model Parallel demo, two exclusive GPU s are assigned to every process so there must be half as many processes as GPU s.
In the demo though, we see world_size = n_gpus in the last code block. This assignment is correct for the function calls run_demo(demo_basic, world_size) and run_demo(demo_checkpoint, world_size) but not for the run_demo(demo_model_parallel, world_size)
I propose to edit if statement in the last block to be:
if __name__ == "__main__":
n_gpus = torch.cuda.device_count()
assert n_gpus >= 2, f"Requires at least 2 GPUs to run, but got {n_gpus}"
world_size = n_gpus
run_demo(demo_basic, world_size)
run_demo(demo_checkpoint, world_size)
world_size = n_gpus//2
run_demo(demo_model_parallel, world_size)
In this tutorial, there are three demos on Distributed Training. The last one is the Model Parallel use case. In the last code block, under the
if
statement, theworld_size
parameter of the function callrun_demo(demo_model_parallel, world_size)
has to beworld_size//2
because in the Model Parallel demo, two exclusive GPU s are assigned to every process so there must be half as many processes as GPU s.In the demo though, we see
world_size = n_gpus
in the last code block. This assignment is correct for the function callsrun_demo(demo_basic, world_size)
andrun_demo(demo_checkpoint, world_size)
but not for therun_demo(demo_model_parallel, world_size)
I propose to edit
if
statement in the last block to be:cc @mrshenli @osalpekar @H-Huang @kwen2501
The text was updated successfully, but these errors were encountered: