Skip to content

Commit 5d8b986

Browse files
Fix deepspeed docs (#15346)
1 parent 96161ac commit 5d8b986

File tree

1 file changed

+12
-12
lines changed

1 file changed

+12
-12
lines changed

docs/source/main_classes/deepspeed.mdx

+12-12
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ won't be possible on a single GPU.
3131

3232
🤗 Transformers integrates [DeepSpeed](https://github.com/microsoft/DeepSpeed) via 2 options:
3333

34-
1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for you type
34+
1. Integration of the core DeepSpeed features via [`Trainer`]. This is everything done for your type
3535
of integration - just supply your custom config file or use our template and you have nothing else to do. Most of
3636
this document is focused on this feature.
3737
2. If you don't use [`Trainer`] and want to use your own Trainer where you integrated DeepSpeed
@@ -97,7 +97,7 @@ TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
9797
--disable-pip-version-check 2>&1 | tee build.log
9898
```
9999

100-
If you intend to use NVMe offload you will need to also include `DS_BUILD_AIO=1` in the instructions above (and also
100+
If you intend to use NVMe offload you will also need to include `DS_BUILD_AIO=1` in the instructions above (and also
101101
install *libaio-dev* system-wide).
102102

103103
Edit `TORCH_CUDA_ARCH_LIST` to insert the code for the architectures of the GPU cards you intend to use. Assuming all
@@ -134,7 +134,7 @@ You can check the archs pytorch was built with using:
134134
python -c "import torch; print(torch.cuda.get_arch_list())"
135135
```
136136

137-
Here is how to find out the arch for one of the installed GPU. For example, for GPU 0:
137+
Here is how to find out the arch for one of the installed GPUs. For example, for GPU 0:
138138

139139
```bash
140140
CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
@@ -169,7 +169,7 @@ following:
169169
2. add a new argument `--deepspeed ds_config.json`, where `ds_config.json` is the DeepSpeed configuration file as
170170
documented [here](https://www.deepspeed.ai/docs/config-json/). The file naming is up to you.
171171

172-
Therefore, if your original command line looked as following:
172+
Therefore, if your original command line looked as follows:
173173

174174
```bash
175175
python -m torch.distributed.launch --nproc_per_node=2 your_program.py <normal cl args>
@@ -214,7 +214,7 @@ For some practical usage examples, please, see this [post](https://github.com/hu
214214

215215
### Deployment with one GPU
216216

217-
To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as following:
217+
To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as follows:
218218

219219
```bash
220220
deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
@@ -560,7 +560,7 @@ Do note that some values, such as `scheduler.params.total_num_steps` are calcula
560560
### ZeRO
561561

562562
[Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/) is the workhorse of DeepSpeed. It
563-
support 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
563+
supports 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes,
564564
therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity.
565565
You will find more indepth information in the DeepSpeed documentation.
566566

@@ -581,7 +581,7 @@ going to use.
581581

582582
#### ZeRO-2 Config
583583

584-
The following is an example configuration for ZeRO stage 2:
584+
The following is an example of configuration for ZeRO stage 2:
585585

586586
```json
587587
{
@@ -604,13 +604,13 @@ The following is an example configuration for ZeRO stage 2:
604604
**Performance tuning:**
605605

606606
- enabling `offload_optimizer` should reduce GPU RAM usage (it requires `"stage": 2`)
607-
- `"overlap_comm": true` trades off increased GPU RAM usage to lower all-reduce latency. `overlap_comm` uses 4.5x
607+
- `"overlap_comm": true` trade offs increased GPU RAM usage to lower all-reduce latency. `overlap_comm` uses 4.5x
608608
the `allgather_bucket_size` and `reduce_bucket_size` values. So if they are set to 5e8, this requires a 9GB
609609
footprint (`5e8 x 2Bytes x 2 x 4.5`). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting
610610
OOM-errors you will need to reduce those parameters to about `2e8`, which would require 3.6GB. You will want to do
611611
the same on larger capacity GPU as well, if you're starting to hit OOM.
612-
- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size,
613-
the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
612+
- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size is,
613+
the slower the communication gets, and the more GPU RAM will be available to other tasks. So if a bigger batch size is
614614
important, getting a slightly slower training time could be a good trade.
615615

616616

@@ -619,7 +619,7 @@ The following is an example configuration for ZeRO stage 2:
619619

620620
#### ZeRO-3 Config
621621

622-
The following is an example configuration for ZeRO stage 3:
622+
The following is an example of configuration for ZeRO stage 3:
623623

624624
```json
625625
{
@@ -662,7 +662,7 @@ and its typically accessed much faster than normal CPU memory.
662662

663663
If hitting OOM reduce `stage3_max_live_parameters` and `stage3_max_reuse_distance`. They should have minimal impact
664664
on performance unless you are doing activation checkpointing. `1e9` would consume ~2GB. The memory is shared by
665-
`stage3_max_live_parameters` and `stage3_max_reuse_distance`, so its not additive, its just 2GB total.
665+
`stage3_max_live_parameters` and `stage3_max_reuse_distance`, so it's not additive, it's just 2GB total.
666666

667667
`stage3_max_live_parameters` is the upper limit on how many full parameters you want to keep on the GPU at any given
668668
time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we

0 commit comments

Comments
 (0)