Skip to content

Commit 2c66f4d

Browse files
authored
fixed more markdown
1 parent 9b6c850 commit 2c66f4d

File tree

1 file changed

+20
-20
lines changed

1 file changed

+20
-20
lines changed

_posts/2023-07-25-announcing-cpp.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -6,42 +6,42 @@ author: John He, Khaled ElGalaind Roshani Nagmote, Daiming Yang
66

77
Training large deep learning models requires large datasets. [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3) is a scalable cloud object store service used for storing large training datasets. Machine learning (ML) practitioners need an efficient data pipe that can download data from Amazon S3, transform the data, and feed the data to GPUs for training models with high throughput and low latency.
88

9-
In this post, we introduce the new S3 IO DataPipes for PyTorch, [S3FileLister](hhttps://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/s3io.py#L19) and [S3FileLoader](https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/s3io.py#L106). For memory efficiency and fast runs, the new DataPipes use the C++ extension to access Amazon S3. Benchmarking shows that S3FileLoader is 59.8% faster than [FSSpecFileOpener](https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/fsspec.py#L125) for downloading a natural language processing (NLP) dataset from Amazon S3. You can build [IterDataPipe](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) training pipelines with the new DataPipes. We also demonstrate that the new DataPipe can reduce overall Bert and ResNet50 training time by 7%. The new DataPipes have been upstreamed to the open-source [TorchData 0.4.0](https://github.com/pytorch/data/releases/tag/v0.4.0) with [PyTorch 1.12.0](https://github.com/pytorch/pytorch/releases/tag/v1.12.0).
9+
In this post, we introduce the new S3 IO DataPipes for PyTorch, [`S3FileLister`](hhttps://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/s3io.py#L19) and [`S3FileLoader`](https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/s3io.py#L106). For memory efficiency and fast runs, the new DataPipes use the C++ extension to access Amazon S3. Benchmarking shows that `S3FileLoader` is 59.8% faster than [`FSSpecFileOpener`](https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/fsspec.py#L125) for downloading a natural language processing (NLP) dataset from Amazon S3. You can build [IterDataPipe](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) training pipelines with the new DataPipes. We also demonstrate that the new DataPipe can reduce overall Bert and ResNet50 training time by 7%. The new DataPipes have been upstreamed to the open-source [`TorchData 0.4.0`](https://github.com/pytorch/data/releases/tag/v0.4.0) with [PyTorch 1.12.0](https://github.com/pytorch/pytorch/releases/tag/v1.12.0).
1010

1111

1212
## Overview
1313

14-
Amazon S3 is a scalable cloud storage service with no limit on data volume. Loading data from Amazon S3 and feeding the data to high-performance GPUs such as NVIDIA A100 can be challenging. It requires an efficient data pipeline that can meet the data processing speed of GPUs. To help with this, we released a new high performance tool for PyTorch: S3 IO DataPipes. DataPipes are subclassed from **torchdata.datapipes.iter.IterDataPipe**, so they can interact with the **IterableDataPipe** interface. Developers can quickly build their DataPipe DAGs to access, transform, and manipulate data with shuffle, sharding, and batch features.
14+
Amazon S3 is a scalable cloud storage service with no limit on data volume. Loading data from Amazon S3 and feeding the data to high-performance GPUs such as NVIDIA A100 can be challenging. It requires an efficient data pipeline that can meet the data processing speed of GPUs. To help with this, we released a new high performance tool for PyTorch: S3 IO DataPipes. DataPipes are subclassed from `torchdata.datapipes.iter.IterDataPipe`, so they can interact with the `IterableDataPipe` interface. Developers can quickly build their DataPipe DAGs to access, transform, and manipulate data with shuffle, sharding, and batch features.
1515

1616
The new DataPipes are designed to be file format agnostic and Amazon S3 data is downloaded as binary large objects (BLOBs). It can be used as a composable building block to assemble a DataPipe graph that can load tabular, NLP, and computer vision (CV) data into your training pipelines.
1717

1818
Under the hood, the new S3 IO DataPipes employ a C++ S3 handler with the AWS C++ SDK. In general, a C++ implementation is more memory efficient and has better CPU core usage (no Global Interpreter Lock) in threading compared to Python. The new C++ S3 IO DataPipes are recommended for high throughput, low latency data loading in training large deep learning models.
1919

2020
The new S3 IO DataPipes provide two first-class citizen APIs:
21-
* **S3FileLister** – Iterable that lists S3 file URLs within the given S3 prefixes. The functional name for this API is **list_files_by_s3**.
22-
* **S3FileLoader** – Iterable that loads S3 files from the given S3 prefixes. The functional name for this API is **load_files_by_s3**.
21+
* **S3FileLister** – Iterable that lists S3 file URLs within the given S3 prefixes. The functional name for this API is `list_files_by_s3`.
22+
* **S3FileLoader** – Iterable that loads S3 files from the given S3 prefixes. The functional name for this API is `load_files_by_s3`.
2323

2424

2525
## Usage
2626

27-
In this section, we provide instructions for using the new S3 IO DataPipes. We also provide a code snippet for **load_files_by_s3()**.
27+
In this section, we provide instructions for using the new S3 IO DataPipes. We also provide a code snippet for `load_files_by_s3()`.
2828

2929
### Build from source
30-
The new S3 IO DataPipes use the C++ extension. It is built into the **torchdata** package by default. However, if the new DataPipes are not available within the environment, for example Windows on Conda, you need to build from the source. For more information, refer to [Iterable Datapipes](https://github.com/pytorch/data/tree/main/torchdata/datapipes/iter/load#s3-io-datapipe-documentation).
30+
The new S3 IO DataPipes use the C++ extension. It is built into the `torchdata` package by default. However, if the new DataPipes are not available within the environment, for example Windows on Conda, you need to build from the source. For more information, refer to [Iterable Datapipes](https://github.com/pytorch/data/tree/main/torchdata/datapipes/iter/load#s3-io-datapipe-documentation).
3131

3232
### Configuration
33-
Amazon S3 supports global buckets. However, a bucket is created within a Region. You can pass a Region to the DataPipes by using **__init__()**. Alternatively, you can either **export AWS_REGION=us-west-2** into your shell or set an environment variable with **os.environ['AWS_REGION'] = 'us-east-1'** in your code.
33+
Amazon S3 supports global buckets. However, a bucket is created within a Region. You can pass a Region to the DataPipes by using `__init__()`. Alternatively, you can either `export AWS_REGION=us-west-2` into your shell or set an environment variable with `os.environ['AWS_REGION'] = 'us-east-1'` in your code.
3434

3535
To read objects in a bucket that aren’t publicly accessible, you must provide AWS credentials through one of the following methods:
3636

37-
* [Install and configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) the [AWS Command Line Interface](aws.amazon.com/cli) (AWS CLI) with **AWS configure**
38-
* Set credentials in the AWS credentials profile file on the local system, located at **~/.aws/credentials** on Linux, macOS, or Unix
39-
* Set the **AWS_ACCESS_KEY_ID** and **AWS_SECRET_ACCESS_KEY** environment variables
37+
* [Install and configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) the [AWS Command Line Interface](aws.amazon.com/cli) (AWS CLI) with `AWS configure`
38+
* Set credentials in the AWS credentials profile file on the local system, located at `~/.aws/credentials` on Linux, macOS, or Unix
39+
* Set the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables
4040
* If you’re using this library on an [Amazon Elastic Compute Cloud](aws.amazon.com/ec2) (Amazon EC2) instance, specify an [AWS Identity and Access Management](aws.amazon.com/iam) (IAM) role and then give the EC2 instance access to that role
4141

4242

4343
### Example code
44-
The following code snippet provides a typical usage of **load_files_by_s3()**:
44+
The following code snippet provides a typical usage of `load_files_by_s3()`:
4545

4646

4747
```
@@ -70,36 +70,36 @@ In this section, we demonstrate how the new DataPipe can reduce overall Bert and
7070

7171
### Isolated DataLoader performance evaluation against FSSpec
7272

73-
**FSSpecFileOpener** is another PyTorch S3 DataPipe. It uses **botocore** and **aiohttp/asyncio** to access S3 data. The following is the performance test setup and result (quoted from [Performance Comparison between native AWSSDK and FSSpec (boto3) based DataPipes](https://github.com/pytorch/data/issues/500)).
73+
`FSSpecFileOpener` is another PyTorch S3 DataPipe. It uses `botocore` and `aiohttp/asyncio` to access S3 data. The following is the performance test setup and result (quoted from [Performance Comparison between native AWSSDK and FSSpec (boto3) based DataPipes](https://github.com/pytorch/data/issues/500)).
7474

7575
The S3 data in the test is a sharded text dataset. Each shard has about 100,000 lines and each line is around 1.6 KB, making each shard about 156 MB. The measurements in this benchmark are averaged over 1,000 batches. No shuffling, sampling, or transforms were performed.
7676

77-
The following chart reports the throughput comparison for various batch sizes for **num_workers=0**, the data loader runs in the main process. **S3FileLoader** has higher queries per second (QPS). It is 90% higher than **fsspec** at batch size 512.
77+
The following chart reports the throughput comparison for various batch sizes for `num_workers=0`, the data loader runs in the main process. `S3FileLoader` has higher queries per second (QPS). It is 90% higher than `fsspec` at batch size 512.
7878

7979

8080
![Batch Sizes 1](/assets/images/2023-7-25-announcing-ccp-based-s3-io-datapipes-1.png){:style="max-width:620px; width:100%; display: block; margin-left: auto; margin-right: auto"}
8181

82-
The following chart reports the results for **num_workers=4**, the data loaders runs in the main process. **S3FileLoader** is 59.8% higher than **fsspec** at batch size 512.
82+
The following chart reports the results for `num_workers=4`, the data loaders runs in the main process. `S3FileLoader` is 59.8% higher than `fsspec` at batch size 512.
8383

8484

8585
![Batch Sizes 2](/assets/images/2023-7-25-announcing-ccp-based-s3-io-datapipes-5.png){:style="max-width:620px; width:100%; display: block; margin-left: auto; margin-right: auto"}
8686

8787
### Training ResNet50 Model against Boto3
88-
For the following chart, we trained a ResNet50 model on a cluster of 4 p3.16xlarge instances with a total 32 GPUs. The training dataset is ImageNet with 1.2 million images organized into 1,000-image shards. The training batch size is 64. The training time is measured in seconds. For eight epochs, **S3FileLoader** is 7.5% faster than Boto3.
88+
For the following chart, we trained a ResNet50 model on a cluster of 4 p3.16xlarge instances with a total 32 GPUs. The training dataset is ImageNet with 1.2 million images organized into 1,000-image shards. The training batch size is 64. The training time is measured in seconds. For eight epochs, `S3FileLoader` is 7.5% faster than Boto3.
8989

9090

9191
![Boto3](/assets/images/2023-7-25-announcing-ccp-based-s3-io-datapipes-2.png){:style="max-width:620px; width:100%; display: block; margin-left: auto; margin-right: auto"}
9292

9393
### Training a Bert model against Boto3
94-
For the following cart, we trained a Bert model on a cluster of 4 p3.16xlarge instances with a total 32 GPUs. The training corpus has 1474 files. Each file has around 150,000 samples. To run a shorter epoch, we use 0.05% (approximately 75 samples) per file. The batch size is 2,048. The training time is measured in seconds. For one epoch, **S3FileLoader** is 7% faster than Boto3.
94+
For the following cart, we trained a Bert model on a cluster of 4 p3.16xlarge instances with a total 32 GPUs. The training corpus has 1474 files. Each file has around 150,000 samples. To run a shorter epoch, we use 0.05% (approximately 75 samples) per file. The batch size is 2,048. The training time is measured in seconds. For one epoch, `S3FileLoader` is 7% faster than Boto3.
9595

9696

9797
![Boto3 2](/assets/images/2023-7-25-announcing-ccp-based-s3-io-datapipes-3.png){:style="max-width:620px; width:100%; display: block; margin-left: auto; margin-right: auto"}
9898

9999
### Comparison against the original PyTorch S3 plugin
100-
The new PyTorch S3 DataPipes perform substantially better than the original [PyTorch S3 plugin](https://github.com/aws/amazon-s3-plugin-for-pytorch). We have tuned the internal buffer size for **S3FileLoader**. The loading time is measured in seconds.
100+
The new PyTorch S3 DataPipes perform substantially better than the original [PyTorch S3 plugin](https://github.com/aws/amazon-s3-plugin-for-pytorch). We have tuned the internal buffer size for `S3FileLoader`. The loading time is measured in seconds.
101101

102-
For the 10 sharded charades files (approximately 1.5 GiB each), **S3FileLoader** was 3.5 times faster in our experiments.
102+
For the 10 sharded charades files (approximately 1.5 GiB each), `S3FileLoader` was 3.5 times faster in our experiments.
103103

104104
### Best practices
105105
Training large deep learning models may require a massive compute cluster with tens or even hundreds of nodes. Each node in the cluster may generate a large number of data loading requests that hit a specific S3 shard. To avoid throttle, we recommend sharding training data across S3 buckets and S3 folders.
@@ -112,7 +112,7 @@ To achieve good performance, it helps to have file sizes that are big enough to
112112

113113
## Conclusion and next steps
114114

115-
In this post, we introduced you to the new PyTorch IO DataPipes. The new DataPipes use **aws-sdk-cpp** and show better performance against Boto3-based data loaders.
115+
In this post, we introduced you to the new PyTorch IO DataPipes. The new DataPipes use `aws-sdk-cpp` and show better performance against Boto3-based data loaders.
116116

117117
For next steps, we plan to improve on usability, performance, and functionality by focusing on the following features:
118118

@@ -121,7 +121,7 @@ For next steps, we plan to improve on usability, performance, and functionality
121121
* **Local caching** – We plan on making model training able to traverse the training dataset for multiple passes. Local caching after the first epoch can cut out time of flight delays from Amazon S3, which can substantially accelerate data retrieval time for subsequent epochs.
122122
* **Customizable configuration** – We plan to expose more parameters such as internal buffer size, multi-part chunk size, and executor count and allow users to further tune data loading efficiency.
123123
* **Amazon S3 upload** – We plan to expand the S3 DataPipes to support upload for checkpointing.
124-
* **Merge with fsspec****fsspec** is used in other systems such as **torch.save()**. We can integrate the new S3 DataPipes with **fsspec** so they can have more use cases.
124+
* **Merge with fsspec**`fsspec` is used in other systems such as `torch.save()`. We can integrate the new S3 DataPipes with `fsspec` so they can have more use cases.
125125

126126

127127

0 commit comments

Comments
 (0)