Skip to content

Commit 714c7d0

Browse files
committed
updated quote
Signed-off-by: cjyabraham <cjyabraham@gmail.com>
1 parent ee819b1 commit 714c7d0

File tree

1 file changed

+1
-3
lines changed

1 file changed

+1
-3
lines changed

_posts/2023-07-31-performant-distributed-checkpointing.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,7 @@ With this option as the new default, DCP now creates a single file per rank duri
4545

4646
By combining sharded_state_dict support with single filer per rank writer, distributed checkpoint was able to accelerate checkpoint saving time over 72x vs the original PyTorch 1.13 save speed, and enable rapid checkpointing for models sizes over 15B which would previously simply time out.
4747

48-
_"Looking back, it’s really astounding the speedups we’ve seen, handling training for many of these models. We went from taking almost half an hour to write a single 11B checkpoint in PyTorch 1.13, to being able to handle a 30B parameter model, with optimizer and dataloader state - so that’s over eight times the raw data - in just over 3 minutes. That’s done wonders for both the stability and efficiency of our jobs, as we scale up training to hundreds of gpus."
49-
50-
**Davis Wertheimer, IBM Research**_
48+
_"Looking back, it’s really astounding the speedups we’ve seen, handling training for many of these models. We went from taking almost half an hour to write a single 11B checkpoint in PyTorch 1.13, to being able to handle a 30B parameter model, with optimizer and dataloader state - so that’s over eight times the raw data - in just over 3 minutes. That’s done wonders for both the stability and efficiency of our jobs, as we scale up training to hundreds of gpus." – **Davis Wertheimer, IBM Research**_
5149

5250
IBM’s adoption has also helped us validate and improve our solutions in a real world, large-scale training environment. As an example, IBM discovered that DCP was working well for them on a single node with multiple GPUs, but erred out when used on multiple nodes.
5351

0 commit comments

Comments
 (0)