@@ -31,7 +31,7 @@ won't be possible on a single GPU.
31
31
32
32
🤗 Transformers integrates [DeepSpeed](https://github.com/microsoft/DeepSpeed) via 2 options:
33
33
34
- 1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for you type
34
+ 1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for your type
35
35
of integration - just supply your custom config file or use our template and you have nothing else to do. Most of
36
36
this document is focused on this feature.
37
37
2. If you don't use [`Trainer`] and want to use your own Trainer where you integrated DeepSpeed
@@ -97,7 +97,7 @@ TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
97
97
--disable-pip-version-check 2>&1 | tee build.log
98
98
```
99
99
100
- If you intend to use NVMe offload you will need to also include ` DS_BUILD_AIO=1 ` in the instructions above (and also
100
+ If you intend to use NVMe offload you will also need to include ` DS_BUILD_AIO=1 ` in the instructions above (and also
101
101
install * libaio-dev* system-wide).
102
102
103
103
Edit ` TORCH_CUDA_ARCH_LIST ` to insert the code for the architectures of the GPU cards you intend to use. Assuming all
@@ -134,7 +134,7 @@ You can check the archs pytorch was built with using:
134
134
python -c " import torch; print(torch.cuda.get_arch_list())"
135
135
```
136
136
137
- Here is how to find out the arch for one of the installed GPU . For example, for GPU 0:
137
+ Here is how to find out the arch for one of the installed GPUs . For example, for GPU 0:
138
138
139
139
``` bash
140
140
CUDA_VISIBLE_DEVICES=0 python -c " import torch; \
@@ -169,7 +169,7 @@ following:
169
169
2 . add a new argument ` --deepspeed ds_config.json ` , where ` ds_config.json ` is the DeepSpeed configuration file as
170
170
documented [ here] ( https://www.deepspeed.ai/docs/config-json/ ) . The file naming is up to you.
171
171
172
- Therefore, if your original command line looked as following :
172
+ Therefore, if your original command line looked as follows :
173
173
174
174
``` bash
175
175
python -m torch.distributed.launch --nproc_per_node=2 your_program.py < normal cl args>
@@ -214,7 +214,7 @@ For some practical usage examples, please, see this [post](https://github.com/hu
214
214
215
215
### Deployment with one GPU
216
216
217
- To deploy DeepSpeed with one GPU adjust the [ ` Trainer ` ] command line arguments as following :
217
+ To deploy DeepSpeed with one GPU adjust the [ ` Trainer ` ] command line arguments as follows :
218
218
219
219
``` bash
220
220
deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
@@ -560,7 +560,7 @@ Do note that some values, such as `scheduler.params.total_num_steps` are calcula
560
560
### ZeRO
561
561
562
562
[ Zero Redundancy Optimizer (ZeRO)] ( https://www.deepspeed.ai/tutorials/zero/ ) is the workhorse of DeepSpeed. It
563
- support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
563
+ supports 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
564
564
therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity.
565
565
You will find more indepth information in the DeepSpeed documentation.
566
566
@@ -581,7 +581,7 @@ going to use.
581
581
582
582
#### ZeRO-2 Config
583
583
584
- The following is an example configuration for ZeRO stage 2:
584
+ The following is an example of configuration for ZeRO stage 2:
585
585
586
586
``` json
587
587
{
@@ -604,13 +604,13 @@ The following is an example configuration for ZeRO stage 2:
604
604
** Performance tuning:**
605
605
606
606
- enabling ` offload_optimizer ` should reduce GPU RAM usage (it requires ` "stage": 2 ` )
607
- - ` "overlap_comm": true ` trades off increased GPU RAM usage to lower all-reduce latency. ` overlap_comm ` uses 4.5x
607
+ - ` "overlap_comm": true ` trade offs increased GPU RAM usage to lower all-reduce latency. ` overlap_comm ` uses 4.5x
608
608
the ` allgather_bucket_size ` and ` reduce_bucket_size ` values. So if they are set to 5e8, this requires a 9GB
609
609
footprint (` 5e8 x 2Bytes x 2 x 4.5 ` ). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting
610
610
OOM-errors you will need to reduce those parameters to about ` 2e8 ` , which would require 3.6GB. You will want to do
611
611
the same on larger capacity GPU as well, if you're starting to hit OOM.
612
- - when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size,
613
- the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
612
+ - when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size is ,
613
+ the slower the communication gets , and the more GPU RAM will be available to other tasks. So if a bigger batch size is
614
614
important, getting a slightly slower training time could be a good trade.
615
615
616
616
@@ -619,7 +619,7 @@ The following is an example configuration for ZeRO stage 2:
619
619
620
620
#### ZeRO-3 Config
621
621
622
- The following is an example configuration for ZeRO stage 3:
622
+ The following is an example of configuration for ZeRO stage 3:
623
623
624
624
``` json
625
625
{
@@ -662,7 +662,7 @@ and its typically accessed much faster than normal CPU memory.
662
662
663
663
If hitting OOM reduce ` stage3_max_live_parameters ` and ` stage3_max_reuse_distance ` . They should have minimal impact
664
664
on performance unless you are doing activation checkpointing. ` 1e9 ` would consume ~ 2GB. The memory is shared by
665
- ` stage3_max_live_parameters ` and ` stage3_max_reuse_distance ` , so its not additive, its just 2GB total.
665
+ ` stage3_max_live_parameters ` and ` stage3_max_reuse_distance ` , so it's not additive, it's just 2GB total.
666
666
667
667
` stage3_max_live_parameters ` is the upper limit on how many full parameters you want to keep on the GPU at any given
668
668
time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we
0 commit comments