Skip to content

Commit c327c9f

Browse files
authored
Update 2023-08-23-large-scale-training-hugging-face.md
1 parent c7ede36 commit c327c9f

File tree

1 file changed

+2
-4
lines changed

1 file changed

+2
-4
lines changed

_posts/2023-08-23-large-scale-training-hugging-face.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ Unset
116116
}
117117
```
118118

119-
5. Now, it’s time to train your model! First, ensure that you have your PyTorch/XLA runtime set up appropriately by setting|
119+
Now, it’s time to train your model! First, ensure that you have your PyTorch/XLA runtime set up appropriately by setting|
120120

121121
```
122122
Unset
@@ -211,9 +211,7 @@ Among these configurations, MFU peaks at 45.1% for the 20B parameter model on v4
211211

212212
There are two actionable insights from these experiments:
213213

214-
First, simply increasing the number of chips without increasing the batch size generally means lower FLOPS utilization, because more time is spent on sharing the model shards. FSDP uses all-reduce communication collectives which are not asynchronous, which means that
215-
216-
chip-to-chip communication cannot be overlapped with computation. As the number of chips increases, the number of model shards that must be communicated increases, and so we should expect the portion of the step time spent on communication to increase with the number of chips.
214+
First, simply increasing the number of chips without increasing the batch size generally means lower FLOPS utilization, because more time is spent on sharing the model shards. FSDP uses all-reduce communication collectives which are not asynchronous, which means that chip-to-chip communication cannot be overlapped with computation. As the number of chips increases, the number of model shards that must be communicated increases, and so we should expect the portion of the step time spent on communication to increase with the number of chips.
217215

218216
Second, increasing the batch size generally means better FLOPS utilization. As the number of chips increases, the memory footprint of the model decreases, which often frees up high bandwidth memory (HBM) to scale up the global batch size. With a larger global batch size, the number of tokens processed in each step increases, and thus, so does the FLOPS per step. As long as the step time does not increase proportionally, we expect a larger global batch size to improve MFU.
219217

0 commit comments

Comments
 (0)