You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2022-6-26-pytorch-1-12-torcharrow-functional-api-for-modules-and-nvfuser-are-now-available.md
+18-14Lines changed: 18 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -5,10 +5,10 @@ author: Team PyTorch
5
5
featured-img: ''
6
6
---
7
7
8
-
We are excited to announce the release of PyTorch 1.12 (release note)! This release is composed of over 3124 commits, 433 contributors. Along with 1.12, we are releasing beta versions of AWS S3 Integration, PyTorch Vision Models on Channels Last on CPU, Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16 and FSDP AmiPI. We want to sincerely thank our dedicated community for your contributions.
8
+
We are excited to announce the release of PyTorch 1.12 (release note)! This release is composed of over 3124 commits, 433 contributors. Along with 1.12, we are releasing beta versions of AWS S3 Integration, PyTorch Vision Models on Channels Last on CPU, Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16 and FSDP API. We want to sincerely thank our dedicated community for your contributions.
9
9
10
10
Summary:
11
-
- Functional APIs to functionally apply module computation with a given set of parameters.
11
+
- Functional APIs to functionally apply module computation with a given set of parameters
12
12
- DataPipes from TorchData fully backward compatible with DataLoader
### (Beta) Complex32 and Complex Convolutions in PyTorch
74
74
75
-
PyTorch today natively supports complex numbers, complex autograd, complex modules, and numerous complex operations, including linear algebra and Fast Fourier Transform (FFT) operators. Many libraries, including torchaudio and ESPNet, already make use of complex numbers in PyTorch, and PyTorch 1.12 further extends complex functionality with complex convolutions and the experimental complex32 (“complex half”) data type that enables half precision FFT operations.
75
+
PyTorch today natively supports complex numbers, complex autograd, complex modules, and numerous complex operations, including linear algebra and Fast Fourier Transform (FFT) operators. Many libraries, including torchaudio and ESPNet, already make use of complex numbers in PyTorch, and PyTorch 1.12 further extends complex functionality with complex convolutions and the experimental complex32 (“complex half”) data type that enables half precision FFT operations. Due to the bugs in CUDA 11.3 package, we recommend using CUDA 11.6 package from wheels if you are using complex numbers.
DataPipes based on [AWSSDK](https://github.com/aws/aws-sdk-cpp) provides the following features backed by native AWSSDK:
85
+
DataPipes based on [AWSSDK](https://github.com/aws/aws-sdk-cpp)have been integrated into TorchData. It provides the following features backed by native AWSSDK:
86
86
- Retrieve list of urls from each S3 bucket based on prefix
87
87
- Support timeout to prevent hanging indefinitely
88
88
- Support to specify S3 bucket region
@@ -141,16 +141,16 @@ Using mixed precision techniques is essential for training many modern deep lear
141
141
### (Beta) Accelerating PyTorch Vision Models with Channels Last on CPU
142
142
143
143
Memory formats have a significant impact on performance when running vision models, generally Channels Last is more favorable from a performance perspective due to better data locality. 1.12 includes fundamental concepts of memory formats and demonstrates performance benefits using Channels Last on popular PyTorch vision models on Intel® Xeon® Scalable processors.
144
-
- Enables Channels Last memory format support for the commonly used operators in CV domain on CPU, applicable for both inference and training.
145
-
- Provides native level optimization on Channels Last kernels from ATen, applicable for both AVX2 and AVX512.
144
+
- Enables Channels Last memory format support for the commonly used operators in CV domain on CPU, applicable for both inference and training
145
+
- Provides native level optimization on Channels Last kernels from ATen, applicable for both AVX2 and AVX512
146
146
- Delivers 1.3x to 1.8x inference performance gain over Channels First for TorchVision models on Intel® Xeon® Ice Lake (or newer) CPUs
147
147
148
148
### (Beta) Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16
149
149
150
150
Reduced precision numeric formats like bfloat16 improves PyTorch performance across multiple deep learning training workloads. PyTorch 1.12 includes the latest software enhancements on bfloat16 which applies to a broader scope of user scenarios and showcases even higher performance gains. The main improvements include:
151
-
- 2x hardware compute throughput vs. float32 with the new bfloat16 native instruction VDPBF16PS, introduced on Intel® Xeon® Cooper Lake CPUs.
152
-
- 1/2 memory footprint of float32, faster speed for memory bandwidth intensive operators.
153
-
- 1.4x to 2.2x inference performance gain over float32 for TorchVision models on Intel® Xeon® Cooper Lake CPUs.
151
+
- 2x hardware compute throughput vs. float32 with the new bfloat16 native instruction VDPBF16PS, introduced on Intel® Xeon® Cooper Lake CPUs
152
+
- 1/2 memory footprint of float32, faster speed for memory bandwidth intensive operators
153
+
- 1.4x to 2.2x inference performance gain over float32 for TorchVision models on Intel® Xeon® Cooper Lake (or newer) CPUs
154
154
155
155
### (Prototype) Introducing Accelerated PyTorch Training on Mac
156
156
@@ -160,6 +160,10 @@ With the PyTorch 1.12 release, developers and researchers can now take advantage
Accelerated GPU training and evaluation speedups over CPU-only (times faster)
165
+
</p>
166
+
163
167
Alongside the new MPS device support, the M1 binaries for Core and Domain libraries that have been available for the last few releases are now an official prototype feature. These binaries can be used to run PyTorch natively on Apple Silicon.
164
168
165
169
### (Prototype) BetterTransformer: Fastpath execution for Transformer Encoder Inference
@@ -183,11 +187,11 @@ PyTorch now supports CPU and GPU fastpath implementations (“BetterTransformer
183
187
[FSDP API](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) helps easily scale large model training by sharding a model’s parameters, gradients and optimizer states across data parallel workers while maintaining the simplicity of data parallelism. The prototype version was released in PyTorch 1.11 with a minimum set of features that helped [scaling tests of models with up to 1T parameters](https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff).
184
188
185
189
In this beta release, FSDP API added the following features to support various production workloads. Highlights of the the newly added features in this beta release include:
186
-
1. Universal sharding strategy API - Users can easily change between sharding strategies with a single line change, and thus compare and use DDP (only data sharding), FSDP (full model and data sharding), or Zero2 (only sharding of optimizer and gradients) to optimize memory and performance for their specific training needs.
187
-
2. Fine grained mixed precision policies - Users can specify a mix of half and full data types (bfloat16, fp16 or fp32) for model parameters, gradient communication, and buffers via mixed precision policies. Models are automatically saved in fp32 to allow for maximum portability.
188
-
3. Transformer auto wrapping policy - allows for optimal wrapping of Transformer based models by registering the models layer class, and thus accelerated training performance.
189
-
4. Faster model initialization using device_id init - initialization is performed in a streaming fashion to avoid OOM issues and optimize init performance vs CPU init.
190
-
5. Rank0 streaming for full model saving of larger models - Fully sharded models can be saved by all GPU’s streaming their shards to the rank 0 GPU, and the model is built in full state on the rank 0 CPU for saving.
190
+
1. Universal sharding strategy API - Users can easily change between sharding strategies with a single line change, and thus compare and use DDP (only data sharding), FSDP (full model and data sharding), or Zero2 (only sharding of optimizer and gradients) to optimize memory and performance for their specific training needs
191
+
2. Fine grained mixed precision policies - Users can specify a mix of half and full data types (bfloat16, fp16 or fp32) for model parameters, gradient communication, and buffers via mixed precision policies. Models are automatically saved in fp32 to allow for maximum portability
192
+
3. Transformer auto wrapping policy - allows for optimal wrapping of Transformer based models by registering the models layer class, and thus accelerated training performance
193
+
4. Faster model initialization using device_id init - initialization is performed in a streaming fashion to avoid OOM issues and optimize init performance vs CPU init
194
+
5. Rank0 streaming for full model saving of larger models - Fully sharded models can be saved by all GPU’s streaming their shards to the rank 0 GPU, and the model is built in full state on the rank 0 CPU for saving
191
195
192
196
For more details and example code, please checkout the [documentation](https://pytorch.org/docs/1.11/fsdp.html?highlight=fsdp#module-torch.distributed.fsdp) and the [tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html).
0 commit comments