diff --git a/_posts/2019-12-06-pytorch-adds-new-tools-and-libraries-welcomes-preferred-networks-to-its-community.md b/_posts/2019-12-06-pytorch-adds-new-tools-and-libraries-welcomes-preferred-networks-to-its-community.md
index 0a13979349de..01d07886b4b2 100644
--- a/_posts/2019-12-06-pytorch-adds-new-tools-and-libraries-welcomes-preferred-networks-to-its-community.md
+++ b/_posts/2019-12-06-pytorch-adds-new-tools-and-libraries-welcomes-preferred-networks-to-its-community.md
@@ -2,6 +2,9 @@
layout: blog_detail
title: 'PyTorch adds new tools and libraries, welcomes Preferred Networks to its community'
author: Team PyTorch
+image: /assets/images/bert2.png
+tags: [four]
+preview: 'PyTorch continues to be used for the latest state-of-the-art research on display at the NeurIPS conference next week, making up nearly [70% of papers](https://chillee.github.io/pytorch-vs-tensorflow/) that cite a framework. In addition, we’re excited to welcome Preferred Networks, the maintainers of the Chainer framework, to the PyTorch community. Their teams are moving fully over to PyTorch for developing their ML capabilities and services.'
---
PyTorch continues to be used for the latest state-of-the-art research on display at the NeurIPS conference next week, making up nearly [70% of papers](https://chillee.github.io/pytorch-vs-tensorflow/) that cite a framework. In addition, we’re excited to welcome Preferred Networks, the maintainers of the Chainer framework, to the PyTorch community. Their teams are moving fully over to PyTorch for developing their ML capabilities and services.
diff --git a/_posts/2019-4-29-stochastic-weight-averaging-in-pytorch.md b/_posts/2019-4-29-stochastic-weight-averaging-in-pytorch.md
index a610776b0c2d..9c7acd9dc3fe 100644
--- a/_posts/2019-4-29-stochastic-weight-averaging-in-pytorch.md
+++ b/_posts/2019-4-29-stochastic-weight-averaging-in-pytorch.md
@@ -3,6 +3,9 @@ layout: blog_detail
title: 'Stochastic Weight Averaging in PyTorch'
author: Pavel Izmailov and Andrew Gordon Wilson
redirect_from: /2019/04/29/road-to-1.0.html
+image: /assets/images/bert2.png
+tags: [two]
+preview: 'In this blogpost we describe the recently proposed Stochastic Weight Averaging (SWA) technique [1, 2], and its new implementation in [`torchcontrib`](https://github.com/pytorch/contrib). SWA is a simple procedure that improves generalization in deep learning over Stochastic Gradient Descent (SGD) at no additional cost, and can be used as a drop-in replacement for any other optimizer in PyTorch.'
---
In this blogpost we describe the recently proposed Stochastic Weight Averaging (SWA) technique [1, 2], and its new implementation in [`torchcontrib`](https://github.com/pytorch/contrib). SWA is a simple procedure that improves generalization in deep learning over Stochastic Gradient Descent (SGD) at no additional cost, and can be used as a drop-in replacement for any other optimizer in PyTorch. SWA has a wide range of applications and features:
diff --git a/_posts/2019-5-1-optimizing-cuda-rnn-with-torchscript.md b/_posts/2019-5-1-optimizing-cuda-rnn-with-torchscript.md
index ce4d7e255a42..4b68a412daa6 100644
--- a/_posts/2019-5-1-optimizing-cuda-rnn-with-torchscript.md
+++ b/_posts/2019-5-1-optimizing-cuda-rnn-with-torchscript.md
@@ -1,13 +1,15 @@
---
layout: blog_detail
-title: "Optimizing CUDA Recurrent Neural Networks with TorchScript"
-author: "The PyTorch Team"
-date: 2019-05-01 8:00:00 -0500
+title: 'Optimizing CUDA Recurrent Neural Networks with TorchScript'
+author: The PyTorch Team
+image: /assets/images/bert2.png
+tags: [three]
+preview: This week, we officially released PyTorch 1.1, a large feature update to PyTorch 1.0. One of the new features we've added is better support for fast, custom Recurrent Neural Networks (fastrnns) with TorchScript (the PyTorch JIT) (https://pytorch.org/docs/stable/jit.html)
---
-This week, we officially released PyTorch 1.1, a large feature update to PyTorch 1.0. One of the new features we've added is better support for fast, custom Recurrent Neural Networks (fastrnns) with TorchScript (the PyTorch JIT) (https://pytorch.org/docs/stable/jit.html).
+This week, we officially released PyTorch 1.1, a large feature update to PyTorch 1.0. One of the new features we've added is better support for fast, custom Recurrent Neural Networks (fastrnns) with TorchScript (the PyTorch JIT) (https://pytorch.org/docs/stable/jit.html).
-RNNs are popular models that have shown good performance on a variety of NLP tasks that come in different shapes and sizes. PyTorch implements a number of the most popular ones, the [Elman RNN](https://pytorch.org/docs/master/nn.html?highlight=rnn#torch.nn.RNN), [GRU](https://pytorch.org/docs/master/nn.html?highlight=gru#torch.nn.GRU), and [LSTM](https://pytorch.org/docs/master/nn.html?highlight=lstm#torch.nn.LSTM) as well as multi-layered and bidirectional variants.
+RNNs are popular models that have shown good performance on a variety of NLP tasks that come in different shapes and sizes. PyTorch implements a number of the most popular ones, the [Elman RNN](https://pytorch.org/docs/master/nn.html?highlight=rnn#torch.nn.RNN), [GRU](https://pytorch.org/docs/master/nn.html?highlight=gru#torch.nn.GRU), and [LSTM](https://pytorch.org/docs/master/nn.html?highlight=lstm#torch.nn.LSTM) as well as multi-layered and bidirectional variants.
However, many users want to implement their own custom RNNs, taking ideas from recent literature. Applying [Layer Normalization](https://arxiv.org/abs/1607.06450) to LSTMs is one such use case. Because the PyTorch CUDA LSTM implementation uses a fused kernel, it is difficult to insert normalizations or even modify the base LSTM implementation. Many users have turned to writing custom implementations using standard PyTorch operators, but such code suffers from high overhead: most PyTorch operations launch at least one kernel on the GPU and RNNs generally run many operations due to their recurrent nature. However, we can apply TorchScript to fuse operations and optimize our code automatically, launching fewer, more optimized kernels on the GPU.
@@ -21,7 +23,7 @@ We are constantly improving our infrastructure on trying to make the performance
1. If the customized operations are all element-wise, that's great because you can get the benefits of the PyTorch JIT's operator fusion automatically!
-2. If you have more complex operations (e.g. reduce ops mixed with element-wise ops), consider grouping the reduce operations and element-wise ops separately in order to fuse the element-wise operations into a single fusion group.
+2. If you have more complex operations (e.g. reduce ops mixed with element-wise ops), consider grouping the reduce operations and element-wise ops separately in order to fuse the element-wise operations into a single fusion group.
3. If you want to know about what has been fused in your custom RNN, you can inspect the operation's optimized graph by using `graph_for` . Using `LSTMCell` as an example:
@@ -87,7 +89,7 @@ We are constantly improving our infrastructure on trying to make the performance
return (%hy, %4, %cy, %outgate.1, %cellgate.1, %forgetgate.1, %ingate.1)
```
-From the above graph we can see that it has a `prim::FusionGroup_0` subgraph that is fusing all element-wise operations in LSTMCell (transpose and matrix multiplication are not element-wise ops). Some graph nodes might be hard to understand in the first place but we will explain some of them in the optimization section, we also omitted some long verbose operators in this post that is there just for correctness.
+From the above graph we can see that it has a `prim::FusionGroup_0` subgraph that is fusing all element-wise operations in LSTMCell (transpose and matrix multiplication are not element-wise ops). Some graph nodes might be hard to understand in the first place but we will explain some of them in the optimization section, we also omitted some long verbose operators in this post that is there just for correctness.
## Variable-length sequences best practices
@@ -108,7 +110,7 @@ Of course, `output` may have some garbage data in the padded regions; use `lengt
## Optimizations
-We will now explain the optimizations performed by the PyTorch JIT to speed up custom RNNs. We will use a simple custom LSTM model in TorchScript to illustrate the optimizations, but many of these are general and apply to other RNNs.
+We will now explain the optimizations performed by the PyTorch JIT to speed up custom RNNs. We will use a simple custom LSTM model in TorchScript to illustrate the optimizations, but many of these are general and apply to other RNNs.
To illustrate the optimizations we did and how we get benefits from those optimizations, we will run a simple custom LSTM model written in TorchScript (you can refer the code in the custom_lstm.py or the below code snippets) and time our changes.
@@ -119,10 +121,10 @@ input_size = 512
hidden_size = 512
mini_batch = 64
numLayers = 1
-seq_length = 100
+seq_length = 100
```
-The most important thing PyTorch JIT did is to compile the python program to a PyTorch JIT IR, which is an intermediate representation used to model the program's graph structure. This IR can then benefit from whole program optimization, hardware acceleration and overall has the potential to provide large computation gains. In this example, we run the initial TorchScript model with only compiler optimization passes that are provided by the JIT, including common subexpression elimination, constant pooling, constant propagation, dead code elimination and some peephole optimizations. We run the model training for 100 times after warm up and average the training time. The initial results for model forward time is around 27ms and backward time is around 64ms, which is a bit far away from what PyTorch cuDNN LSTM provided. Next we will explain the major optimizations we did on how we improve the performance on training or inferencing, starting with LSTMCell and LSTMLayer, and some misc optimizations.
+The most important thing PyTorch JIT did is to compile the python program to a PyTorch JIT IR, which is an intermediate representation used to model the program's graph structure. This IR can then benefit from whole program optimization, hardware acceleration and overall has the potential to provide large computation gains. In this example, we run the initial TorchScript model with only compiler optimization passes that are provided by the JIT, including common subexpression elimination, constant pooling, constant propagation, dead code elimination and some peephole optimizations. We run the model training for 100 times after warm up and average the training time. The initial results for model forward time is around 27ms and backward time is around 64ms, which is a bit far away from what PyTorch cuDNN LSTM provided. Next we will explain the major optimizations we did on how we improve the performance on training or inferencing, starting with LSTMCell and LSTMLayer, and some misc optimizations.
### LSTM Cell (forward)
@@ -159,11 +161,11 @@ class LSTMCell(jit.ScriptModule):
```
-This graph representation (IR) that TorchScript generated enables several optimizations and scalable computations. In addition to the typical compiler optimizations that we could do (CSE, constant propagation, etc. ) we can also run other IR transformations to make our code run faster.
+This graph representation (IR) that TorchScript generated enables several optimizations and scalable computations. In addition to the typical compiler optimizations that we could do (CSE, constant propagation, etc. ) we can also run other IR transformations to make our code run faster.
* Element-wise operator fusion. PyTorch JIT will automatically fuse element-wise ops, so when you have adjacent operators that are all element-wise, JIT will automatically group all those operations together into a single FusionGroup, this FusionGroup can then be launched with a single GPU/CPU kernel and performed in one pass. This avoids expensive memory reads and writes for each operation.
* Reordering chunks and pointwise ops to enable more fusion. An LSTM cell adds gates together (a pointwise operation), and then chunks the gates into four pieces: the ifco gates. Then, it performs pointwise operations on the ifco gates like above. This leads to two fusion groups in practice: one fusion group for the element-wise ops pre-chunk, and one group for the element-wise ops post-chunk.
- The interesting thing to note here is that pointwise operations commute with `torch.chunk`: Instead of performing pointwise ops on some input tensors and chunking the output, we can chunk the input tensors and then perform the same pointwise ops on the output tensors. By moving the chunk to before the first fusion group, we can merge the first and second fusion groups into one big group.
+ The interesting thing to note here is that pointwise operations commute with `torch.chunk`: Instead of performing pointwise ops on some input tensors and chunking the output, we can chunk the input tensors and then perform the same pointwise ops on the output tensors. By moving the chunk to before the first fusion group, we can merge the first and second fusion groups into one big group.

@@ -171,7 +173,7 @@ This graph representation (IR) that TorchScript generated enables several optimi
* Tensor creation on the CPU is expensive, but there is ongoing work to make it faster. At this point, a LSTMCell runs three CUDA kernels: two `gemm` kernels and one for the single pointwise group. One of the things we noticed was that there was a large gap between the finish of the second `gemm` and the start of the single pointwise group. This gap was a period of time when the GPU was idling around and not doing anything. Looking into it more, we discovered that the problem was that `torch.chunk` constructs new tensors and that tensor construction was not as fast as it could be. Instead of constructing new Tensor objects, we taught the fusion compiler how to manipulate a data pointer and strides to do the `torch.chunk` before sending it into the fused kernel, shrinking the amount of idle time between the second gemm and the launch of the element-wise fusion group. This give us around 1.2x increase speed up on the LSTM forward pass.
-By doing the above tricks, we are able to fuse the almost all `LSTMCell` forward graph (except the two gemm kernels) into a single fusion group, which corresponds to the `prim::FusionGroup_0` in the above IR graph. It will then be launched into a single fused kernel for execution. With these optimizations the model performance improves significantly with average forward time reduced by around 17ms (1.7x speedup) to 10ms, and average backward time reduce by 37ms to 27ms (1.37x speed up).
+By doing the above tricks, we are able to fuse the almost all `LSTMCell` forward graph (except the two gemm kernels) into a single fusion group, which corresponds to the `prim::FusionGroup_0` in the above IR graph. It will then be launched into a single fused kernel for execution. With these optimizations the model performance improves significantly with average forward time reduced by around 17ms (1.7x speedup) to 10ms, and average backward time reduce by 37ms to 27ms (1.37x speed up).
### LSTM Layer (forward)
@@ -195,31 +197,31 @@ class LSTMLayer(jit.ScriptModule):
We did several tricks on the IR we generated for TorchScript LSTM to boost the performance, some example optimizations we did:
* Loop Unrolling: We automatically unroll loops in the code (for big loops, we unroll a small subset of it), which then empowers us to do further optimizations on the for loops control flow. For example, the fuser can fuse together operations across iterations of the loop body, which results in a good performance improvement for control flow intensive models like LSTMs.
-* Batch Matrix Multiplication: For RNNs where the input is pre-multiplied (i.e. the model has a lot of matrix multiplies with the same LHS or RHS), we can efficiently batch those operations together into a single matrix multiply while chunking the outputs to achieve equivalent semantics.
+* Batch Matrix Multiplication: For RNNs where the input is pre-multiplied (i.e. the model has a lot of matrix multiplies with the same LHS or RHS), we can efficiently batch those operations together into a single matrix multiply while chunking the outputs to achieve equivalent semantics.
-By applying these techniques, we reduced our time in the forward pass by an additional 1.6ms to 8.4ms (1.2x speed up) and timing in backward by 7ms to around 20ms (1.35x speed up).
+By applying these techniques, we reduced our time in the forward pass by an additional 1.6ms to 8.4ms (1.2x speed up) and timing in backward by 7ms to around 20ms (1.35x speed up).
### LSTM Layer (backward)
* “Tree” Batch Matrix Muplication: It is often the case that a single weight is reused multiple times in the LSTM backward graph, forming a tree where the leaves are matrix multiplies and nodes are adds. These nodes can be combined together by concatenating the LHSs and RHSs in different dimensions, then computed as a single matrix multiplication. The formula of equivalence can be denoted as follows:
-
+
$L1 * R1 + L2 * R2 = torch.cat((L1, L2), dim=1) * torch.cat((R1, R2), dim=0)$
-
-* Autograd is a critical component of what makes PyTorch such an elegant ML framework. As such, we carried this through to PyTorch JIT, but using a new **Automatic Differentiation** (AD) mechanism that works on the IR level. JIT automatic differentiation will slice the forward graph into symbolically differentiable subgraphs, and generate backwards nodes for those subgraphs. Taking the above IR as an example, we group the graph nodes into a single `prim::DifferentiableGraph_0` for the operations that has AD formulas. For operations that have not been added to AD formulas, we will fall back to Autograd during execution.
+
+* Autograd is a critical component of what makes PyTorch such an elegant ML framework. As such, we carried this through to PyTorch JIT, but using a new **Automatic Differentiation** (AD) mechanism that works on the IR level. JIT automatic differentiation will slice the forward graph into symbolically differentiable subgraphs, and generate backwards nodes for those subgraphs. Taking the above IR as an example, we group the graph nodes into a single `prim::DifferentiableGraph_0` for the operations that has AD formulas. For operations that have not been added to AD formulas, we will fall back to Autograd during execution.
* Optimizing the backwards path is hard, and the implicit broadcasting semantics make the optimization of automatic differentiation harder. PyTorch makes it convenient to write tensor operations without worrying about the shapes by broadcasting the tensors for you. For performance, the painful point in backward is that we need to have a summation for such kind of broadcastable operations. This results in the derivative of every broadcastable op being followed by a summation. Since we cannot currently fuse reduce operations, this causes FusionGroups to break into multiple small groups leading to bad performance. To deal with this, refer to this great [post](http://lernapparat.de/fast-lstm-pytorch/) written by Thomas Viehmann.
### Misc Optimizations
* In addition to the steps laid about above, we also eliminated overhead between CUDA kernel launches and unnecessary tensor allocations. One example is when you do a tensor device look up. This can provide some poor performance initially with a lot of unnecessary allocations. When we remove these this results in a reduction from milliseconds to nanoseconds between kernel launches.
-* Lastly, there might be normalization applied in the custom LSTMCell like LayerNorm. Since LayerNorm and other normalization ops contains reduce operations, it is hard to fuse it in its entirety. Instead, we automatically decompose Layernorm to a statistics computation (reduce operations) + element-wise transformations, and then fuse those element-wise parts together. As of this post, there are some limitations on our auto differentiation and graph fuser infrastructure which limits the current support to inference mode only. We plan to add backward support in a future release.
+* Lastly, there might be normalization applied in the custom LSTMCell like LayerNorm. Since LayerNorm and other normalization ops contains reduce operations, it is hard to fuse it in its entirety. Instead, we automatically decompose Layernorm to a statistics computation (reduce operations) + element-wise transformations, and then fuse those element-wise parts together. As of this post, there are some limitations on our auto differentiation and graph fuser infrastructure which limits the current support to inference mode only. We plan to add backward support in a future release.
-With the above optimizations on operation fusion, loop unrolling, batch matrix multiplication and some misc optimizations, we can see a clear performance increase on our custom TorchScript LSTM forward and backward from the following figure:
+With the above optimizations on operation fusion, loop unrolling, batch matrix multiplication and some misc optimizations, we can see a clear performance increase on our custom TorchScript LSTM forward and backward from the following figure:
-There are a number of additional optimizations that we did not cover in this post. In addition to the ones laid out in this post, we now see that our custom LSTM forward pass is on par with cuDNN. We are also working on optimizing backward more and expect to see improvements in future releases. Besides the speed that TorchScript provides, we introduced a much more flexible API that enable you to hand draft a lot more custom RNNs, which cuDNN could not provide.
+There are a number of additional optimizations that we did not cover in this post. In addition to the ones laid out in this post, we now see that our custom LSTM forward pass is on par with cuDNN. We are also working on optimizing backward more and expect to see improvements in future releases. Besides the speed that TorchScript provides, we introduced a much more flexible API that enable you to hand draft a lot more custom RNNs, which cuDNN could not provide.
diff --git a/_posts/2019-5-22-torchvision03.md b/_posts/2019-5-22-torchvision03.md
index eb807b4394b3..1a4bd9330ac0 100644
--- a/_posts/2019-5-22-torchvision03.md
+++ b/_posts/2019-5-22-torchvision03.md
@@ -3,6 +3,9 @@ layout: blog_detail
title: 'torchvision 0.3: segmentation, detection models, new datasets and more..'
author: Francisco Massa
redirect_from: /2019/05/23/torchvision03.html
+image: /assets/images/bert2.png
+tags: [three]
+preview: 'PyTorch domain libraries like torchvision provide convenient access to common datasets and models that can be used to quickly create a state-of-the-art baseline. Moreover, they also provide common abstractions to reduce boilerplate code that users might have to otherwise repeatedly write. The torchvision 0.3 release brings several new features including models for semantic segmentation, object detection, instance segmentation, and person keypoint detection, as well as custom C++ / CUDA ops specific to computer vision.'
---
PyTorch domain libraries like torchvision provide convenient access to common datasets and models that can be used to quickly create a state-of-the-art baseline. Moreover, they also provide common abstractions to reduce boilerplate code that users might have to otherwise repeatedly write. The torchvision 0.3 release brings several new features including models for semantic segmentation, object detection, instance segmentation, and person keypoint detection, as well as custom C++ / CUDA ops specific to computer vision.
diff --git a/_posts/2020-07-28-accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision.md b/_posts/2020-07-28-accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision.md
index 3a45e36b35fa..a79ba620b99f 100644
--- a/_posts/2020-07-28-accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision.md
+++ b/_posts/2020-07-28-accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision.md
@@ -2,14 +2,17 @@
layout: blog_detail
title: 'Introducing native PyTorch automatic mixed precision for faster training on NVIDIA GPUs'
author: Mengdi Huang, Chetan Tekur, Michael Carilli
+image: /assets/images/bert2.png
+tags: [five]
+preview: 'Most deep learning frameworks, including PyTorch, train with 32-bit floating point (FP32) arithmetic by default. However this is not essential to achieve full accuracy for many deep learning models. In 2017, NVIDIA researchers developed a methodology for (FP32) with half-precision (e.g. FP16) format when training a network, and achieved the same accuracy as FP32 training using the same hyperparameters, with additional performance benefits on NVIDIA GPUs:'
---
-Most deep learning frameworks, including PyTorch, train with 32-bit floating point (FP32) arithmetic by default. However this is not essential to achieve full accuracy for many deep learning models. In 2017, NVIDIA researchers developed a methodology for [mixed-precision training](https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/), which combined [single-precision](https://blogs.nvidia.com/blog/2019/11/15/whats-the-difference-between-single-double-multi-and-mixed-precision-computing/) (FP32) with half-precision (e.g. FP16) format when training a network, and achieved the same accuracy as FP32 training using the same hyperparameters, with additional performance benefits on NVIDIA GPUs:
+Most deep learning frameworks, including PyTorch, train with 32-bit floating point (FP32) arithmetic by default. However this is not essential to achieve full accuracy for many deep learning models. In 2017, NVIDIA researchers developed a methodology for (FP32) with half-precision (e.g. FP16) format when training a network, and achieved the same accuracy as FP32 training using the same hyperparameters, with additional performance benefits on NVIDIA GPUs:
* Shorter training time;
* Lower memory requirements, enabling larger batch sizes, larger models, or larger inputs.
-In order to streamline the user experience of training in mixed precision for researchers and practitioners, NVIDIA developed [Apex](https://developer.nvidia.com/blog/apex-pytorch-easy-mixed-precision-training/) in 2018, which is a lightweight PyTorch extension with [Automatic Mixed Precision](https://developer.nvidia.com/automatic-mixed-precision) (AMP) feature. This feature enables automatic conversion of certain GPU operations from FP32 precision to mixed precision, thus improving performance while maintaining accuracy.
+In order to streamline the user experience of training in mixed precision for researchers and practitioners, NVIDIA developed [Apex](https://developer.nvidia.com/blog/apex-pytorch-easy-mixed-precision-training/) in 2018, which is a lightweight PyTorch extension with [Automatic Mixed Precision](https://developer.nvidia.com/automatic-mixed-precision) (AMP) feature. This feature enables automatic conversion of certain GPU operations from FP32 precision to mixed precision, thus improving performance while maintaining accuracy.
For the PyTorch 1.6 release, developers at NVIDIA and Facebook moved mixed precision functionality into PyTorch core as the AMP package, [torch.cuda.amp](https://pytorch.org/docs/stable/amp.html). `torch.cuda.amp` is more flexible and intuitive compared to `apex.amp`. Some of `apex.amp`'s known pain points that `torch.cuda.amp` has been able to fix:
@@ -22,7 +25,7 @@ For the PyTorch 1.6 release, developers at NVIDIA and Facebook moved mixed preci
* torch.cuda.amp.autocast() has no effect outside regions where it's enabled, so it should serve cases that formerly struggled with multiple calls to [apex.amp.initialize()](https://github.com/NVIDIA/apex/issues/439) (including [cross-validation)](https://github.com/NVIDIA/apex/issues/392#issuecomment-610038073) without difficulty. Multiple convergence runs in the same script should each use a fresh [GradScaler instance](https://github.com/NVIDIA/apex/issues/439#issuecomment-610028282), but GradScalers are lightweight and self-contained so that's not a problem.
* Sparse gradient support
-With AMP being added to PyTorch core, we have started the process of deprecating `apex.amp.` We have moved `apex.amp` to maintenance mode and will support customers using `apex.amp.` However, we highly encourage `apex.amp` customers to transition to using `torch.cuda.amp` from PyTorch Core.
+With AMP being added to PyTorch core, we have started the process of deprecating `apex.amp.` We have moved `apex.amp` to maintenance mode and will support customers using `apex.amp.` However, we highly encourage `apex.amp` customers to transition to using `torch.cuda.amp` from PyTorch Core.
# Example Walkthrough
Please see official docs for usage:
@@ -32,33 +35,33 @@ Please see official docs for usage:
Example:
```python
-import torch
-# Creates once at the beginning of training
-scaler = torch.cuda.amp.GradScaler()
-
-for data, label in data_iter:
- optimizer.zero_grad()
- # Casts operations to mixed precision
- with torch.cuda.amp.autocast():
- loss = model(data)
-
- # Scales the loss, and calls backward()
- # to create scaled gradients
- scaler.scale(loss).backward()
-
- # Unscales gradients and calls
- # or skips optimizer.step()
- scaler.step(optimizer)
-
- # Updates the scale for next iteration
- scaler.update()
+import torch
+# Creates once at the beginning of training
+scaler = torch.cuda.amp.GradScaler()
+
+for data, label in data_iter:
+ optimizer.zero_grad()
+ # Casts operations to mixed precision
+ with torch.cuda.amp.autocast():
+ loss = model(data)
+
+ # Scales the loss, and calls backward()
+ # to create scaled gradients
+ scaler.scale(loss).backward()
+
+ # Unscales gradients and calls
+ # or skips optimizer.step()
+ scaler.step(optimizer)
+
+ # Updates the scale for next iteration
+ scaler.update()
```
# Performance Benchmarks
In this section, we discuss the accuracy and performance of mixed precision training with AMP on the latest NVIDIA GPU A100 and also previous generation V100 GPU. The mixed precision performance is compared to FP32 performance, when running Deep Learning workloads in the [NVIDIA pytorch:20.06-py3 container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch?ncid=partn-52193#cid=ngc01_partn_en-us) from NGC.
## Accuracy: AMP (FP16), FP32
-The advantage of using AMP for Deep Learning training is that the models converge to the similar final accuracy while providing improved training performance. To illustrate this point, for [Resnet 50 v1.5 training](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets/resnet50v1.5#training-accuracy-nvidia-dgx-a100-8x-a100-40gb), we see the following accuracy results where higher is better. Please note that the below accuracy numbers are sample numbers that are subject to run to run variance of up to 0.4%. Accuracy numbers for other models including BERT, Transformer, ResNeXt-101, Mask-RCNN, DLRM can be found at [NVIDIA Deep Learning Examples Github](https://github.com/NVIDIA/DeepLearningExamples).
+The advantage of using AMP for Deep Learning training is that the models converge to the similar final accuracy while providing improved training performance. To illustrate this point, for [Resnet 50 v1.5 training](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets/resnet50v1.5#training-accuracy-nvidia-dgx-a100-8x-a100-40gb), we see the following accuracy results where higher is better. Please note that the below accuracy numbers are sample numbers that are subject to run to run variance of up to 0.4%. Accuracy numbers for other models including BERT, Transformer, ResNeXt-101, Mask-RCNN, DLRM can be found at [NVIDIA Deep Learning Examples Github](https://github.com/NVIDIA/DeepLearningExamples).
Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
@@ -78,7 +81,7 @@ Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
-
+
@@ -104,7 +107,7 @@ Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
-## Speedup Performance:
+## Speedup Performance:
### FP16 on NVIDIA V100 vs. FP32 on V100
AMP with FP16 is the most performant option for DL training on the V100. In Table 1, we can observe that for various models, AMP on V100 provides a speedup of 1.5x to 5.5x over FP32 on V100 while converging to the same final accuracy.
@@ -124,8 +127,8 @@ AMP with FP16 remains the most performant option for DL training on the A100. In
*Figure 3. Performance of mixed precision training on NVIDIA 8xA100 vs. 8xV100 GPU. Bars represent the speedup factor of A100 over V100. The higher the better.*
# Call to action
-AMP provides a healthy speedup for Deep Learning training workloads on Nvidia Tensor Core GPUs, especially on the latest Ampere generation A100 GPUs. You can start experimenting with AMP enabled models and model scripts for A100, V100, T4 and other GPUs available at NVIDIA deep learning [examples](https://github.com/NVIDIA/DeepLearningExamples). NVIDIA PyTorch with native AMP support is available from the [PyTorch NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch?ncid=partn-52193#cid=ngc01_partn_en-us) version 20.06. We highly encourage existing `apex.amp` customers to transition to using `torch.cuda.amp` from PyTorch Core available in the latest [PyTorch 1.6 release](https://pytorch.org/blog/pytorch-1.6-released/).
+AMP provides a healthy speedup for Deep Learning training workloads on Nvidia Tensor Core GPUs, especially on the latest Ampere generation A100 GPUs. You can start experimenting with AMP enabled models and model scripts for A100, V100, T4 and other GPUs available at NVIDIA deep learning [examples](https://github.com/NVIDIA/DeepLearningExamples). NVIDIA PyTorch with native AMP support is available from the [PyTorch NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch?ncid=partn-52193#cid=ngc01_partn_en-us) version 20.06. We highly encourage existing `apex.amp` customers to transition to using `torch.cuda.amp` from PyTorch Core available in the latest [PyTorch 1.6 release](https://pytorch.org/blog/pytorch-1.6-released/).
+
+
+
-
-
-
diff --git a/_posts/2020-07-28-microsoft-becomes-maintainer-of-the-windows-version-of-pytorch.md b/_posts/2020-07-28-microsoft-becomes-maintainer-of-the-windows-version-of-pytorch.md
index 8f74a50570bd..9f67193bdf09 100644
--- a/_posts/2020-07-28-microsoft-becomes-maintainer-of-the-windows-version-of-pytorch.md
+++ b/_posts/2020-07-28-microsoft-becomes-maintainer-of-the-windows-version-of-pytorch.md
@@ -2,14 +2,16 @@
layout: blog_detail
title: 'Microsoft becomes maintainer of the Windows version of PyTorch'
author: Maxim Lukiyanov - Principal PM at Microsoft, Emad Barsoum - Group EM at Microsoft, Guoliang Hua - Principal EM at Microsoft, Nikita Shulga - Tech Lead at Facebook, Geeta Chauhan - PE Lead at Facebook, Chris Gottbrath - Technical PM at Facebook, Jiachen Pu - Engineer at Facebook
-
+image: /assets/images/bert2.png
+tags: [five]
+preview: 'Along with the PyTorch 1.6 release, we are excited to announce that Microsoft has expanded its participation in the PyTorch community and is taking ownership of the development and maintenance of the PyTorch build for Windows.'
---
Along with the PyTorch 1.6 release, we are excited to announce that Microsoft has expanded its participation in the PyTorch community and is taking ownership of the development and maintenance of the PyTorch build for Windows.
According to the latest [Stack Overflow developer survey](https://insights.stackoverflow.com/survey/2020#technology-developers-primary-operating-systems), Windows remains the primary operating system for the developer community (46% Windows vs 28% MacOS). [Jiachen Pu](https://github.com/peterjc123) initially made a heroic effort to add support for PyTorch on Windows, but due to limited resources, Windows support for PyTorch has lagged behind other platforms. Lack of test coverage resulted in unexpected issues popping up every now and then. Some of the core tutorials, meant for new users to learn and adopt PyTorch, would fail to run. The installation experience was also not as smooth, with the lack of official PyPI support for PyTorch on Windows. Lastly, some of the PyTorch functionality was simply not available on the Windows platform, such as the TorchAudio domain library and distributed training support. To help alleviate this pain, Microsoft is happy to bring its Windows expertise to the table and bring PyTorch on Windows to its best possible self.
-In the PyTorch 1.6 release, we have improved the core quality of the Windows build by bringing test coverage up to par with Linux for core PyTorch and its domain libraries and by automating tutorial testing. Thanks to the broader PyTorch community, which contributed TorchAudio support to Windows, we were able to add test coverage to all three domain libraries: TorchVision, TorchText and TorchAudio. In subsequent releases of PyTorch, we will continue improving the Windows experience based on community feedback and requests. So far, the feedback we received from the community points to distributed training support and a better installation experience using pip as the next areas of improvement.
+In the PyTorch 1.6 release, we have improved the core quality of the Windows build by bringing test coverage up to par with Linux for core PyTorch and its domain libraries and by automating tutorial testing. Thanks to the broader PyTorch community, which contributed TorchAudio support to Windows, we were able to add test coverage to all three domain libraries: TorchVision, TorchText and TorchAudio. In subsequent releases of PyTorch, we will continue improving the Windows experience based on community feedback and requests. So far, the feedback we received from the community points to distributed training support and a better installation experience using pip as the next areas of improvement.
In addition to the native Windows experience, Microsoft released a preview adding [GPU compute support to Windows Subsystem for Linux (WSL) 2](https://blogs.windows.com/windowsdeveloper/2020/06/17/gpu-accelerated-ml-training-inside-the-windows-subsystem-for-linux/) distros, with a focus on enabling AI and ML developer workflows. WSL is designed for developers that want to run any Linux based tools directly on Windows. This preview enables valuable scenarios for a variety of frameworks and Python packages that utilize [NVIDIA CUDA](https://developer.nvidia.com/cuda/wsl) for acceleration and only support Linux. This means WSL customers using the preview can run native Linux based PyTorch applications on Windows unmodified without the need for a traditional virtual machine or a dual boot setup.
diff --git a/_posts/2020-07-28-pytorch-feature-classification-changes.md b/_posts/2020-07-28-pytorch-feature-classification-changes.md
index 1ace6ef10388..13c850b83453 100644
--- a/_posts/2020-07-28-pytorch-feature-classification-changes.md
+++ b/_posts/2020-07-28-pytorch-feature-classification-changes.md
@@ -2,6 +2,9 @@
layout: blog_detail
title: 'PyTorch feature classification changes'
author: Team PyTorch
+image: /assets/images/bert2.png
+tags: [six]
+preview: 'Traditionally features in PyTorch were classified as either stable or experimental with an implicit third option of testing bleeding edge features by building master or through installing nightly builds (available via prebuilt whls). This has, in a few cases, caused some confusion around the level of readiness, commitment to the feature and backward compatibility that can be expected from a user perspective. Moving forward, we’d like to better classify the 3 types of features as well as define explicitly here what each mean from a user perspective.'
---
Traditionally features in PyTorch were classified as either stable or experimental with an implicit third option of testing bleeding edge features by building master or through installing nightly builds (available via prebuilt whls). This has, in a few cases, caused some confusion around the level of readiness, commitment to the feature and backward compatibility that can be expected from a user perspective. Moving forward, we’d like to better classify the 3 types of features as well as define explicitly here what each mean from a user perspective.
diff --git a/_posts/2020-08-11-efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus.md b/_posts/2020-08-11-efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus.md
index 0f5f5873e26c..356940a81d7d 100644
--- a/_posts/2020-08-11-efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus.md
+++ b/_posts/2020-08-11-efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus.md
@@ -2,6 +2,9 @@
layout: blog_detail
title: 'Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs'
author: Alex Aizman, Gavin Maltby, Thomas Breuel
+image: /assets/images/bert2.png
+tags: [six]
+preview: 'Data sets are growing bigger every day and GPUs are getting faster. This means there are more data sets for deep learning researchers and engineers to train and validate their models.'
---
Data sets are growing bigger every day and GPUs are getting faster. This means there are more data sets for deep learning researchers and engineers to train and validate their models.
@@ -20,7 +23,7 @@ However, working with the large amount of data sets presents a number of challen
* **Shuffling and Augmentation:** training data needs to be shuffled and augmented prior to training.
* **Scalability:** users often want to develop and test on small datasets and then rapidly scale up to large datasets.
-Traditional local and network file systems, and even object storage servers, are not designed for these kinds of applications. [The WebDataset I/O library](https://github.com/tmbdev/webdataset) for PyTorch, together with the optional [AIStore server](https://github.com/NVIDIA/aistore) and [Tensorcom](https://github.com/NVlabs/tensorcom) RDMA libraries, provide an efficient, simple, and standards-based solution to all these problems. The library is simple enough for day-to-day use, is based on mature open source standards, and is easy to migrate to from existing file-based datasets.
+Traditional local and network file systems, and even object storage servers, are not designed for these kinds of applications. [The WebDataset I/O library](https://github.com/tmbdev/webdataset) for PyTorch, together with the optional [AIStore server](https://github.com/NVIDIA/aistore) and [Tensorcom](https://github.com/NVlabs/tensorcom) RDMA libraries, provide an efficient, simple, and standards-based solution to all these problems. The library is simple enough for day-to-day use, is based on mature open source standards, and is easy to migrate to from existing file-based datasets.
Using WebDataset is simple and requires little effort, and it will let you scale up the same code from running local experiments to using hundreds of GPUs on clusters or in the cloud with linearly scalable performance. Even on small problems and on your desktop, it can speed up I/O tenfold and simplifies data management and processing of large datasets. The rest of this blog post tells you how to get started with WebDataset and how it works.
@@ -38,7 +41,7 @@ The WebDataset library is a complete solution for working with large datasets an
## Benefits
-The use of sharded, sequentially readable formats is essential for very large datasets. In addition, it has benefits in many other environments. WebDataset provides a solution that scales well from small problems on a desktop machine to very large deep learning problems in clusters or in the cloud. The following table summarizes some of the benefits in different environments.
+The use of sharded, sequentially readable formats is essential for very large datasets. In addition, it has benefits in many other environments. WebDataset provides a solution that scales well from small problems on a desktop machine to very large deep learning problems in clusters or in the cloud. The following table summarizes some of the benefits in different environments.
{:.table.table-striped.table-bordered}
| Environment | Benefits of WebDataset |
@@ -122,7 +125,7 @@ for inputs, targets in loader:
```
This code is nearly identical to the file-based I/O pipeline found in the PyTorch Imagenet example: it creates a preprocessing/augmentation pipeline, instantiates a dataset using that pipeline and a data source location, and then constructs a DataLoader instance from the dataset.
-
+
WebDataset uses a fluent API for a configuration that internally builds up a processing pipeline. Without any added processing stages, In this example, WebDataset is used with the PyTorch DataLoader class, which replicates DataSet instances across multiple threads and performs both parallel I/O and parallel data augmentation.
WebDataset instances themselves just iterate through each training sample as a dictionary:
diff --git a/_posts/2020-08-18-pytorch-1.6-now-includes-stochastic-weight-averaging.md b/_posts/2020-08-18-pytorch-1.6-now-includes-stochastic-weight-averaging.md
index 106761d7cf43..047b208eada7 100644
--- a/_posts/2020-08-18-pytorch-1.6-now-includes-stochastic-weight-averaging.md
+++ b/_posts/2020-08-18-pytorch-1.6-now-includes-stochastic-weight-averaging.md
@@ -2,9 +2,14 @@
layout: blog_detail
title: 'PyTorch 1.6 now includes Stochastic Weight Averaging'
author: Pavel Izmailov, Andrew Gordon Wilson and Vincent Queneneville-Belair
+image: /assets/images/bert2.png
+tags: [six]
+preview: 'Do you use stochastic gradient descent (SGD) or Adam? Regardless of the procedure you use to train your neural network, you can likely achieve significantly better generalization at virtually no additional cost with a simple new technique now natively supported in PyTorch 1.6, Stochastic Weight Averaging (SWA) [1]. Even if you have already trained your model, it’s easy to realize the benefits of SWA by running SWA for a small number of epochs starting with a pre-trained model.'
---
-Do you use stochastic gradient descent (SGD) or Adam? Regardless of the procedure you use to train your neural network, you can likely achieve significantly better generalization at virtually no additional cost with a simple new technique now natively supported in PyTorch 1.6, Stochastic Weight Averaging (SWA) [1]. Even if you have already trained your model, it’s easy to realize the benefits of SWA by running SWA for a small number of epochs starting with a pre-trained model. [Again](https://twitter.com/MilesCranmer/status/1282140440892932096) and [again](https://twitter.com/leopd/status/1285969855062192129), researchers are discovering that SWA improves the performance of well-tuned models in a wide array of practical applications with little cost or effort!
+Do you use stochastic gradient descent (SGD) or Adam? Regardless of the procedure you use to train your neural network, you can likely achieve significantly better generalization at virtually no additional cost with a simple new technique now natively supported in PyTorch 1.6, Stochastic Weight Averaging (SWA) [1]. Even if you have already trained your model, it’s easy to realize the benefits of SWA by running SWA for a small number of epochs starting with a pre-trained model.
+
+[Again](https://twitter.com/MilesCranmer/status/1282140440892932096) and [again](https://twitter.com/leopd/status/1285969855062192129), researchers are discovering that SWA improves the performance of well-tuned models in a wide array of practical applications with little cost or effort!
SWA has a wide range of applications and features:
@@ -30,7 +35,7 @@ Previously, SWA was in PyTorch contrib. In PyTorch 1.6, we provide a new conveni
At a high level, averaging SGD iterates dates back several decades in convex optimization [7, 8], where it is sometimes referred to as Polyak-Ruppert averaging, or averaged SGD. **But the details matter**. Averaged SGD is often used in conjunction with a decaying learning rate, and an exponential moving average (EMA), typically for convex optimization. In convex optimization, the focus has been on improved rates of convergence. In deep learning, this form of averaged SGD smooths the trajectory of SGD iterates but does not perform very differently.
-By contrast, SWA uses an **equal average** of SGD iterates with a modified **cyclical or high constant learning rate** and exploits the flatness of training objectives [8] specific to **deep learning** for **improved generalization**.
+By contrast, SWA uses an **equal average** of SGD iterates with a modified **cyclical or high constant learning rate** and exploits the flatness of training objectives [8] specific to **deep learning** for **improved generalization**.
## How does Stochastic Weight Averaging Work?
@@ -48,9 +53,9 @@ While we focus on SGD for simplicity in the description above, SWA can be combin
## How to use SWA in PyTorch?
-In `torch.optim.swa_utils` we implement all the SWA ingredients to make it convenient to use SWA with any model. In particular, we implement `AveragedModel` class for SWA models, `SWALR` learning rate scheduler, and `update_bn` utility function to update SWA batch normalization statistics at the end of training.
+In `torch.optim.swa_utils` we implement all the SWA ingredients to make it convenient to use SWA with any model. In particular, we implement `AveragedModel` class for SWA models, `SWALR` learning rate scheduler, and `update_bn` utility function to update SWA batch normalization statistics at the end of training.
-In the example below, `swa_model` is the SWA model that accumulates the averages of the weights. We train the model for a total of 300 epochs, and we switch to the SWA learning rate schedule and start to collect SWA averages of the parameters at epoch 160.
+In the example below, `swa_model` is the SWA model that accumulates the averages of the weights. We train the model for a total of 300 epochs, and we switch to the SWA learning rate schedule and start to collect SWA averages of the parameters at epoch 160.
```python
from torch.optim.swa_utils import AveragedModel, SWALR
@@ -75,7 +80,7 @@ for epoch in range(100):
# Update bn statistics for the swa_model at the end
torch.optim.swa_utils.update_bn(loader, swa_model)
-# Use swa_model to make predictions on test data
+# Use swa_model to make predictions on test data
preds = swa_model(test_input)
```
@@ -94,7 +99,7 @@ In practice, we find an equal average with the modified learning rate schedule i
`SWALR` is a learning rate scheduler that anneals the learning rate to a fixed value, and then keeps it constant. For example, the following code creates a scheduler that linearly anneals the learning rate from its initial value to `0.05` in `5` epochs within each parameter group.
```python
-swa_scheduler = torch.optim.swa_utils.SWALR(optimizer,
+swa_scheduler = torch.optim.swa_utils.SWALR(optimizer,
anneal_strategy="linear", anneal_epochs=5, swa_lr=0.05)
```
@@ -114,7 +119,7 @@ for epoch in range(100):
Finally, `update_bn` is a utility function that computes the batchnorm statistics for the SWA model on a given dataloader `loader`:
```
-torch.optim.swa_utils.update_bn(loader, swa_model)
+torch.optim.swa_utils.update_bn(loader, swa_model)
```
`update_bn` applies the `swa_model` to every element in the dataloader and computes the activation statistics for each batch normalization layer in the model.
@@ -145,9 +150,9 @@ We release a GitHub [repo](https://github.com/izmailovpavel/torch_swa_examples)
{:.table.table-striped.table-bordered}
- | | VGG-16 | ResNet-164 | WideResNet-28x10 |
+ | | VGG-16 | ResNet-164 | WideResNet-28x10 |
| ------------- | ------------- | ------------- | ------------- |
-| SGD | 72.8 ± 0.3 | 78.4 ± 0.3 | 81.0 ± 0.3 |
+| SGD | 72.8 ± 0.3 | 78.4 ± 0.3 | 81.0 ± 0.3 |
| SWA | 74.4 ± 0.3 | 79.8 ± 0.4 | 82.5 ± 0.2 |
@@ -166,8 +171,8 @@ In another follow-up [paper](http://www.gatsby.ucl.ac.uk/~balaji/udl-camera-read
{:.table.table-striped.table-bordered}
- | Environment Name | A2C | A2C + SWA |
-| ------------- | ------------- | ------------- |
+ | Environment Name | A2C | A2C + SWA |
+| ------------- | ------------- | ------------- |
| Breakout | 522 ± 34 | 703 ± 60 |
| Qbert | 18777 ± 778 | 21272 ± 655 |
| SpaceInvaders | 7727 ± 1121 | 21676 ± 8897 |
@@ -183,7 +188,7 @@ We can filter through quantization noise by combining weights that have been rou
-**Figure 9**. *Quantizing a solution leads to a perturbation of the weights which has a greater effect on the quality of the sharp solution (left) compared to wide solution (right)*.
+**Figure 9**. *Quantizing a solution leads to a perturbation of the weights which has a greater effect on the quality of the sharp solution (left) compared to wide solution (right)*.
@@ -204,7 +209,7 @@ SWA can be viewed as taking the first moment of SGD iterates with a modified lea
**Figure 6**. *SWAG posterior approximation and the loss surface for a ResNet-20 without skip-connections trained on CIFAR-10 in the subspace formed by the two largest eigenvalues of the SWAG covariance matrix. The shape of SWAG distribution is aligned with the posterior: the peaks of the two distributions coincide, and both distributions are wider in one direction than in the orthogonal direction. Visualization created in collaboration with* [Javier Ideami](https://losslandscape.com/).
-Empirically, SWAG performs on par or better than popular alternatives including MC dropout, KFAC Laplace, and temperature scaling on uncertainty quantification, out-of-distribution detection, calibration and transfer learning in computer vision tasks. Code for SWAG is available [here](https://github.com/wjmaddox/swa_gaussian).
+Empirically, SWAG performs on par or better than popular alternatives including MC dropout, KFAC Laplace, and temperature scaling on uncertainty quantification, out-of-distribution detection, calibration and transfer learning in computer vision tasks. Code for SWAG is available [here](https://github.com/wjmaddox/swa_gaussian).

@@ -214,7 +219,7 @@ Empirically, SWAG performs on par or better than popular alternatives including
MultiSWAG [9] uses multiple independent SWAG models to form a mixture of Gaussians as an approximate posterior distribution. Different basins of attraction contain highly complementary explanations of the data. Accordingly, marginalizing over these multiple basins provides a significant boost in accuracy and uncertainty representation. MultiSWAG can be viewed as a generalization of deep ensembles, but with performance improvements.
-Indeed, we see in Figure 8 that MultiSWAG entirely mitigates double descent -- more flexible models have monotonically improving performance -- and provides significantly improved generalization over SGD. For example, when the ResNet-18 has layers of width 20, Multi-SWAG achieves under 30% error whereas SGD achieves over 45%, more than a 15% gap!
+Indeed, we see in Figure 8 that MultiSWAG entirely mitigates double descent -- more flexible models have monotonically improving performance -- and provides significantly improved generalization over SGD. For example, when the ResNet-18 has layers of width 20, Multi-SWAG achieves under 30% error whereas SGD achieves over 45%, more than a 15% gap!

@@ -227,18 +232,18 @@ Another [method](https://arxiv.org/abs/1907.07504), Subspace Inference, construc
## Try it Out!
-One of the greatest open questions in deep learning is why SGD manages to find good solutions, given that the training objectives are highly multimodal, and there are many settings of parameters that achieve no training loss but poor generalization. By understanding geometric features such as flatness, which relate to generalization, we can begin to resolve these questions and build optimizers that provide even better generalization, and many other useful features, such as uncertainty representation. We have presented SWA, a simple drop-in replacement for standard optimizers such as SGD and Adam, which can in principle, benefit anyone training a deep neural network. SWA has been demonstrated to have a strong performance in several areas, including computer vision, semi-supervised learning, reinforcement learning, uncertainty representation, calibration, Bayesian model averaging, and low precision training.
+One of the greatest open questions in deep learning is why SGD manages to find good solutions, given that the training objectives are highly multimodal, and there are many settings of parameters that achieve no training loss but poor generalization. By understanding geometric features such as flatness, which relate to generalization, we can begin to resolve these questions and build optimizers that provide even better generalization, and many other useful features, such as uncertainty representation. We have presented SWA, a simple drop-in replacement for standard optimizers such as SGD and Adam, which can in principle, benefit anyone training a deep neural network. SWA has been demonstrated to have a strong performance in several areas, including computer vision, semi-supervised learning, reinforcement learning, uncertainty representation, calibration, Bayesian model averaging, and low precision training.
-We encourage you to try out SWA! SWA is now as easy as any standard training in PyTorch. And even if you have already trained your model, you can use SWA to significantly improve performance by running it for a small number of epochs from a pre-trained model.
+We encourage you to try out SWA! SWA is now as easy as any standard training in PyTorch. And even if you have already trained your model, you can use SWA to significantly improve performance by running it for a small number of epochs from a pre-trained model.
[1] Averaging Weights Leads to Wider Optima and Better Generalization; Pavel Izmailov, Dmitry Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson; Uncertainty in Artificial Intelligence (UAI), 2018.
-[2] There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average; Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, Andrew Gordon Wilson;
+[2] There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average; Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, Andrew Gordon Wilson;
International Conference on Learning Representations (ICLR), 2019.
-[3] Improving Stability in Deep Reinforcement Learning with Weight Averaging; Evgenii Nikishin, Pavel Izmailov, Ben Athiwaratkun, Dmitrii Podoprikhin,
+[3] Improving Stability in Deep Reinforcement Learning with Weight Averaging; Evgenii Nikishin, Pavel Izmailov, Ben Athiwaratkun, Dmitrii Podoprikhin,
Timur Garipov, Pavel Shvechikov, Dmitry Vetrov, Andrew Gordon Wilson; UAI 2018 Workshop: Uncertainty in Deep Learning, 2018.
[4] A Simple Baseline for Bayesian Uncertainty in Deep Learning
@@ -249,7 +254,7 @@ Pavel Izmailov, Wesley Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov,
Uncertainty in Artificial Intelligence (UAI), 2019.
[6] SWALP : Stochastic Weight Averaging in Low Precision Training
-Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai,
+Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai,
Andrew Gordon Wilson, Christopher De Sa; International Conference on Machine Learning (ICML), 2019.
[7] David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process; Technical report, Cornell University Operations Research and Industrial Engineering, 1988.
@@ -257,7 +262,7 @@ Andrew Gordon Wilson, Christopher De Sa; International Conference on Machine Lea
[8] Acceleration of stochastic approximation by averaging. Boris T Polyak and Anatoli B Juditsky; SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
[9] Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
-Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov,
+Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov,
Andrew Gordon Wilson. Neural Information Processing Systems (NeurIPS), 2018.
[10] Bayesian Deep Learning and a Probabilistic Perspective of Generalization
diff --git a/_posts/2020-1-15-pytorch-1-dot-4-released-and-domain-libraries-updated.md b/_posts/2020-1-15-pytorch-1-dot-4-released-and-domain-libraries-updated.md
index 2be782f18b47..dd1de7d70f8b 100644
--- a/_posts/2020-1-15-pytorch-1-dot-4-released-and-domain-libraries-updated.md
+++ b/_posts/2020-1-15-pytorch-1-dot-4-released-and-domain-libraries-updated.md
@@ -2,6 +2,9 @@
layout: blog_detail
title: 'PyTorch 1.4 released, domain libraries updated'
author: Team PyTorch
+image: /assets/images/bert2.png
+tags: [five]
+preview: 'Today, we’re announcing the availability of PyTorch 1.4, along with updates to the PyTorch domain libraries. These releases build on top of the announcements from [NeurIPS 2019](https://pytorch.org/blog/pytorch-adds-new-tools-and-libraries-welcomes-preferred-networks-to-its-community/), where we shared the availability of PyTorch Elastic, a new classification framework for image and video, and the addition of Preferred Networks to the PyTorch community. For those that attended the workshops at NeurIPS, the content can be found [here](https://research.fb.com/neurips-2019-expo-workshops/).'
---
Today, we’re announcing the availability of PyTorch 1.4, along with updates to the PyTorch domain libraries. These releases build on top of the announcements from [NeurIPS 2019](https://pytorch.org/blog/pytorch-adds-new-tools-and-libraries-welcomes-preferred-networks-to-its-community/), where we shared the availability of PyTorch Elastic, a new classification framework for image and video, and the addition of Preferred Networks to the PyTorch community. For those that attended the workshops at NeurIPS, the content can be found [here](https://research.fb.com/neurips-2019-expo-workshops/).
@@ -43,7 +46,7 @@ To learn more about the APIs and the design of this feature, see the links below
* [Distributed Autograd design doc](https://pytorch.org/docs/stable/notes/distributed_autograd.html)
* [Remote Reference design doc](https://pytorch.org/docs/stable/notes/rref.html)
-For the full tutorials, see the links below:
+For the full tutorials, see the links below:
* [A full RPC tutorial](https://pytorch.org/tutorials/intermediate/rpc_tutorial.html)
* [Examples using model parallel training for reinforcement learning and with an LSTM](https://github.com/pytorch/examples/tree/master/distributed/rpc)
diff --git a/_posts/2020-3-26-introduction-to-quantization-on-pytorch.md b/_posts/2020-3-26-introduction-to-quantization-on-pytorch.md
index a23bdc353b4b..7dd77f23efd2 100644
--- a/_posts/2020-3-26-introduction-to-quantization-on-pytorch.md
+++ b/_posts/2020-3-26-introduction-to-quantization-on-pytorch.md
@@ -2,6 +2,9 @@
layout: blog_detail
title: 'Introduction to Quantization on PyTorch'
author: Raghuraman Krishnamoorthi, James Reed, Min Ni, Chris Gottbrath, and Seth Weidman
+image: /assets/images/bert2.png
+tags: [five]
+preview: 'It’s important to make efficient use of both server-side and on-device compute resources when developing machine learning applications. To support more efficient deployment on servers and edge devices, PyTorch added a support for model quantization using the familiar eager mode Python API.'
---
It’s important to make efficient use of both server-side and on-device compute resources when developing machine learning applications. To support more efficient deployment on servers and edge devices, PyTorch added a support for model quantization using the familiar eager mode Python API.
diff --git a/_posts/2020-4-21-pytorch-1-dot-5-released-with-new-and-updated-apis.md b/_posts/2020-4-21-pytorch-1-dot-5-released-with-new-and-updated-apis.md
index e81d2f7da780..1793dafd2b12 100644
--- a/_posts/2020-4-21-pytorch-1-dot-5-released-with-new-and-updated-apis.md
+++ b/_posts/2020-4-21-pytorch-1-dot-5-released-with-new-and-updated-apis.md
@@ -2,6 +2,9 @@
layout: blog_detail
title: 'PyTorch 1.5 released, new and updated APIs including C++ frontend API parity with Python'
author: Team PyTorch
+image: /assets/images/bert2.png
+tags: [yellow]
+preview: 'Today, we’re announcing the availability of PyTorch 1.5, along with new and updated libraries. This release includes several major new API additions and improvements. PyTorch now includes a significant update to the C++ frontend, ‘channels last’ memory format for computer vision models, and a stable release of the distributed RPC framework used for model-parallel training. The release also has new APIs for autograd for hessians and jacobians, and an API that allows the creation of Custom C++ Classes that was inspired by pybind.'
---
diff --git a/_posts/2020-4-21-pytorch-library-updates-new-model-serving-library.md b/_posts/2020-4-21-pytorch-library-updates-new-model-serving-library.md
index 69101b8abc09..e6f875f8333d 100644
--- a/_posts/2020-4-21-pytorch-library-updates-new-model-serving-library.md
+++ b/_posts/2020-4-21-pytorch-library-updates-new-model-serving-library.md
@@ -2,10 +2,13 @@
layout: blog_detail
title: 'PyTorch library updates including new model serving library '
author: Team PyTorch
+image: /assets/images/bert2.png
+tags: [five]
+preview: 'Along with the PyTorch 1.5 release, we are announcing new libraries for high-performance PyTorch model serving and tight integration with TorchElastic and Kubernetes. Additionally, we are releasing updated packages for torch_xla (Google Cloud TPUs), torchaudio, torchvision, and torchtext. All of these new libraries and enhanced capabilities are available today and accompany all of the core features [released in PyTorch 1.5](https://pytorch.org/blog/pytorch-1-dot-5-released-with-new-and-updated-apis).'
---
-Along with the PyTorch 1.5 release, we are announcing new libraries for high-performance PyTorch model serving and tight integration with TorchElastic and Kubernetes. Additionally, we are releasing updated packages for torch_xla (Google Cloud TPUs), torchaudio, torchvision, and torchtext. All of these new libraries and enhanced capabilities are available today and accompany all of the core features [released in PyTorch 1.5](https://pytorch.org/blog/pytorch-1-dot-5-released-with-new-and-updated-apis).
+Along with the PyTorch 1.5 release, we are announcing new libraries for high-performance PyTorch model serving and tight integration with TorchElastic and Kubernetes. Additionally, we are releasing updated packages for torch_xla (Google Cloud TPUs), torchaudio, torchvision, and torchtext. All of these new libraries and enhanced capabilities are available today and accompany all of the core features [released in PyTorch 1.5](https://pytorch.org/blog/pytorch-1-dot-5-released-with-new-and-updated-apis).
## TorchServe (Experimental)
@@ -35,7 +38,7 @@ To learn more see the [TorchElastic repo](http://pytorch.org/elastic/0.2.0rc0/ku
## torch_xla 1.5 now available
-[torch_xla](http://pytorch.org/xla/) is a Python package that uses the [XLA linear algebra compiler](https://www.tensorflow.org/xla) to accelerate the [PyTorch deep learning framework](https://pytorch.org/) on [Cloud TPUs](https://cloud.google.com/tpu/) and [Cloud TPU Pods](https://cloud.google.com/tpu/docs/tutorials/pytorch-pod). torch_xla aims to give PyTorch users the ability to do everything they can do on GPUs on Cloud TPUs as well while minimizing changes to the user experience. The project began with a conversation at NeurIPS 2017 and gathered momentum in 2018 when teams from Facebook and Google came together to create a proof of concept. We announced this collaboration at PTDC 2018 and made the PyTorch/XLA integration broadly available at PTDC 2019. The project already has 28 contributors, nearly 2k commits, and a repo that has been forked more than 100 times.
+[torch_xla](http://pytorch.org/xla/) is a Python package that uses the [XLA linear algebra compiler](https://www.tensorflow.org/xla) to accelerate the [PyTorch deep learning framework](https://pytorch.org/) on [Cloud TPUs](https://cloud.google.com/tpu/) and [Cloud TPU Pods](https://cloud.google.com/tpu/docs/tutorials/pytorch-pod). torch_xla aims to give PyTorch users the ability to do everything they can do on GPUs on Cloud TPUs as well while minimizing changes to the user experience. The project began with a conversation at NeurIPS 2017 and gathered momentum in 2018 when teams from Facebook and Google came together to create a proof of concept. We announced this collaboration at PTDC 2018 and made the PyTorch/XLA integration broadly available at PTDC 2019. The project already has 28 contributors, nearly 2k commits, and a repo that has been forked more than 100 times.
This release of [torch_xla](http://pytorch.org/xla/) is aligned and tested with PyTorch 1.5 to reduce friction for developers and to provide a stable and mature PyTorch/XLA stack for training models using Cloud TPU hardware. You can [try it for free](https://medium.com/pytorch/get-started-with-pytorch-cloud-tpus-and-colab-a24757b8f7fc) in your browser on an 8-core Cloud TPU device with [Google Colab](https://colab.research.google.com/), and you can use it at a much larger scaleon [Google Cloud](https://cloud.google.com/gcp).
@@ -48,9 +51,9 @@ torchaudio, torchvision, and torchtext complement PyTorch with common datasets,
### torchaudio 0.5
The torchaudio 0.5 release includes new transforms, functionals, and datasets. Highlights for the release include:
-* Added the Griffin-Lim functional and transform, `InverseMelScale` and `Vol` transforms, and `DB_to_amplitude`.
+* Added the Griffin-Lim functional and transform, `InverseMelScale` and `Vol` transforms, and `DB_to_amplitude`.
* Added support for `allpass`, `fade`, `bandpass`, `bandreject`, `band`, `treble`, `deemph`, and `riaa` filters and transformations.
-* New datasets added including `LJSpeech` and `SpeechCommands` datasets.
+* New datasets added including `LJSpeech` and `SpeechCommands` datasets.
See the release full notes [here](https://github.com/pytorch/audio/releases) and full docs can be found [here](https://pytorch.org/audio/).
@@ -58,7 +61,7 @@ See the release full notes [here](https://github.com/pytorch/audio/releases) and
The torchvision 0.6 release includes updates to datasets, models and a significant number of bug fixes. Highlights include:
* Faster R-CNN now supports negative samples which allows the feeding of images without annotations at training time.
-* Added `aligned` flag to `RoIAlign` to match Detectron2.
+* Added `aligned` flag to `RoIAlign` to match Detectron2.
* Refactored abstractions for C++ video decoder
See the release full notes [here](https://github.com/pytorch/vision/releases) and full docs can be found [here](https://pytorch.org/docs/stable/torchvision/index.html).
@@ -68,9 +71,9 @@ The torchtext 0.6 release includes a number of bug fixes and improvements to doc
* Fixed an issue related to the SentencePiece dependency in conda package.
* Added support for the experimental IMDB dataset to allow a custom vocab.
-* A number of documentation updates including adding a code of conduct and a deduplication of the docs on the torchtext site.
+* A number of documentation updates including adding a code of conduct and a deduplication of the docs on the torchtext site.
-Your feedback and discussions on the experimental datasets API are welcomed. You can send them to [issue #664](https://github.com/pytorch/text/issues/664). We would also like to highlight the pull request [here](https://github.com/pytorch/text/pull/701) where the latest dataset abstraction is applied to the text classification datasets. The feedback can be beneficial to finalizing this abstraction.
+Your feedback and discussions on the experimental datasets API are welcomed. You can send them to [issue #664](https://github.com/pytorch/text/issues/664). We would also like to highlight the pull request [here](https://github.com/pytorch/text/pull/701) where the latest dataset abstraction is applied to the text classification datasets. The feedback can be beneficial to finalizing this abstraction.
See the release full notes [here](https://github.com/pytorch/text/releases) and full docs can be found [here](https://pytorch.org/text/).
diff --git a/_posts/2020-5-5-updates-improvements-to-pytorch-tutorials.md b/_posts/2020-5-5-updates-improvements-to-pytorch-tutorials.md
index 1f0f8a9fc6d5..e5fbffbef2ed 100644
--- a/_posts/2020-5-5-updates-improvements-to-pytorch-tutorials.md
+++ b/_posts/2020-5-5-updates-improvements-to-pytorch-tutorials.md
@@ -2,14 +2,17 @@
layout: blog_detail
title: 'Updates & Improvements to PyTorch Tutorials'
author: Team PyTorch
+image: /assets/images/bert2.png
+tags: [five]
+preview: 'PyTorch.org provides researchers and developers with documentation, installation instructions, latest news, community projects, tutorials, and more. Today, we are introducing usability and content improvements including tutorials in additional categories, a new recipe format for quickly referencing common topics, sorting using tags, and an updated homepage.'
---
-PyTorch.org provides researchers and developers with documentation, installation instructions, latest news, community projects, tutorials, and more. Today, we are introducing usability and content improvements including tutorials in additional categories, a new recipe format for quickly referencing common topics, sorting using tags, and an updated homepage.
+PyTorch.org provides researchers and developers with documentation, installation instructions, latest news, community projects, tutorials, and more. Today, we are introducing usability and content improvements including tutorials in additional categories, a new recipe format for quickly referencing common topics, sorting using tags, and an updated homepage.
-Let’s take a look at them in detail.
+Let’s take a look at them in detail.
## TUTORIALS HOME PAGE UPDATE
-The tutorials home page now provides clear actions that developers can take. For new PyTorch users, there is an easy-to-discover button to take them directly to “A 60 Minute Blitz”. Right next to it, there is a button to view all recipes which are designed to teach specific features quickly with examples.
+The tutorials home page now provides clear actions that developers can take. For new PyTorch users, there is an easy-to-discover button to take them directly to “A 60 Minute Blitz”. Right next to it, there is a button to view all recipes which are designed to teach specific features quickly with examples.

@@ -26,7 +29,7 @@ The following additional resources can also be found at the bottom of the Tutori
* [PyTorch Examples](https://github.com/pytorch/examples)
* [Tutorial on GitHub](https://github.com/pytorch/tutorials)
-## PYTORCH RECIPES
+## PYTORCH RECIPES
Recipes are new bite-sized, actionable examples designed to teach researchers and developers how to use specific PyTorch features. Some notable new recipes include:
* [Loading Data in PyTorch](https://pytorch.org/tutorials/recipes/recipes/loading_data_recipe.html)
* [Model Interpretability Using Captum](https://pytorch.org/tutorials/recipes/recipes/Captum_Recipe.html)
@@ -35,7 +38,7 @@ Recipes are new bite-sized, actionable examples designed to teach researchers an
View the full recipes [here](http://pytorch.org/tutorials/recipes/recipes_index.html).
## LEARNING PYTORCH
-This section includes tutorials designed for users new to PyTorch. Based on community feedback, we have made updates to the current [Deep Learning with PyTorch: A 60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) tutorial, one of our most popular tutorials for beginners. Upon completion, one can understand what PyTorch and neural networks are, and be able to build and train a simple image classification network. Updates include adding explanations to clarify output meanings and linking back to where users can read more in the docs, cleaning up confusing syntax errors, and reconstructing and explaining new concepts for easier readability.
+This section includes tutorials designed for users new to PyTorch. Based on community feedback, we have made updates to the current [Deep Learning with PyTorch: A 60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) tutorial, one of our most popular tutorials for beginners. Upon completion, one can understand what PyTorch and neural networks are, and be able to build and train a simple image classification network. Updates include adding explanations to clarify output meanings and linking back to where users can read more in the docs, cleaning up confusing syntax errors, and reconstructing and explaining new concepts for easier readability.
## DEPLOYING MODELS IN PRODUCTION
This section includes tutorials for developers looking to take their PyTorch models to production. The tutorials include:
@@ -45,7 +48,7 @@ This section includes tutorials for developers looking to take their PyTorch mod
* [Exploring a Model from PyTorch to ONNX and Running it using ONNX Runtime](https://pytorch.org/tutorials/advanced/super_resolution_with_onnxruntime.html)
## FRONTEND APIS
-PyTorch provides a number of frontend API features that can help developers to code, debug, and validate their models more efficiently. This section includes tutorials that teach what these features are and how to use them. Some tutorials to highlight:
+PyTorch provides a number of frontend API features that can help developers to code, debug, and validate their models more efficiently. This section includes tutorials that teach what these features are and how to use them. Some tutorials to highlight:
* [Introduction to Named Tensors in PyTorch](https://pytorch.org/tutorials/intermediate/named_tensor_tutorial.html)
* [Using the PyTorch C++ Frontend](https://pytorch.org/tutorials/advanced/cpp_frontend.html)
* [Extending TorchScript with Custom C++ Operators](https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html)
@@ -59,7 +62,7 @@ Deep learning models often consume large amounts of memory, power, and compute d
* [Static Quantization with Eager Mode in PyTorch](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html)
## PARALLEL AND DISTRIBUTED TRAINING
-PyTorch provides features that can accelerate performance in research and production such as native support for asynchronous execution of collective operations and peer-to-peer communication that is accessible from Python and C++. This section includes tutorials on parallel and distributed training:
+PyTorch provides features that can accelerate performance in research and production such as native support for asynchronous execution of collective operations and peer-to-peer communication that is accessible from Python and C++. This section includes tutorials on parallel and distributed training:
* [Single-Machine Model Parallel Best Practices](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html)
* [Getting started with Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
* [Getting started with Distributed RPC Framework](https://pytorch.org/tutorials/intermediate/rpc_tutorial.html)
diff --git a/_posts/2020-7-28-pytorch-1.6-released.md b/_posts/2020-7-28-pytorch-1.6-released.md
index d1a18284dc69..1bf36e61e665 100644
--- a/_posts/2020-7-28-pytorch-1.6-released.md
+++ b/_posts/2020-7-28-pytorch-1.6-released.md
@@ -2,24 +2,27 @@
layout: blog_detail
title: 'PyTorch 1.6 released w/ Native AMP Support, Microsoft joins as maintainers for Windows'
author: Team PyTorch
+image: /assets/images/bert2.png
+tags: [five]
+preview: 'Today, we’re announcing the availability of PyTorch 1.6, along with updated domain libraries. We are also excited to announce the team at [Microsoft is now maintaining Windows builds and binaries](https://pytorch.org/blog/microsoft-becomes-maintainer-of-the-windows-version-of-pytorch) and will also be supporting the community on GitHub as well as the PyTorch Windows discussion forums.'
---
Today, we’re announcing the availability of PyTorch 1.6, along with updated domain libraries. We are also excited to announce the team at [Microsoft is now maintaining Windows builds and binaries](https://pytorch.org/blog/microsoft-becomes-maintainer-of-the-windows-version-of-pytorch) and will also be supporting the community on GitHub as well as the PyTorch Windows discussion forums.
-The PyTorch 1.6 release includes a number of new APIs, tools for performance improvement and profiling, as well as major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training.
-A few of the highlights include:
+The PyTorch 1.6 release includes a number of new APIs, tools for performance improvement and profiling, as well as major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training.
+A few of the highlights include:
-1. Automatic mixed precision (AMP) training is now natively supported and a stable feature (See [here](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/) for more details) - thanks for NVIDIA’s contributions;
-2. Native TensorPipe support now added for tensor-aware, point-to-point communication primitives built specifically for machine learning;
+1. Automatic mixed precision (AMP) training is now natively supported and a stable feature (See [here](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/) for more details) - thanks for NVIDIA’s contributions;
+2. Native TensorPipe support now added for tensor-aware, point-to-point communication primitives built specifically for machine learning;
3. Added support for complex tensors to the frontend API surface;
4. New profiling tools providing tensor-level memory consumption information;
5. Numerous improvements and new features for both distributed data parallel (DDP) training and the remote procedural call (RPC) packages.
-Additionally, from this release onward, features will be classified as Stable, Beta and Prototype. Prototype features are not included as part of the binary distribution and are instead available through either building from source, using nightlies or via compiler flag. You can learn more about what this change means in the post [here](https://pytorch.org/blog/pytorch-feature-classification-changes/). You can also find the full release notes [here](https://github.com/pytorch/pytorch/releases).
+Additionally, from this release onward, features will be classified as Stable, Beta and Prototype. Prototype features are not included as part of the binary distribution and are instead available through either building from source, using nightlies or via compiler flag. You can learn more about what this change means in the post [here](https://pytorch.org/blog/pytorch-feature-classification-changes/). You can also find the full release notes [here](https://github.com/pytorch/pytorch/releases).
# Performance & Profiling
-## [Stable] Automatic Mixed Precision (AMP) Training
+## [Stable] Automatic Mixed Precision (AMP) Training
AMP allows users to easily enable automatic mixed precision training enabling higher performance and memory savings of up to 50% on Tensor Core GPUs. Using the natively supported `torch.cuda.amp` API, AMP provides convenience methods for mixed precision, where some operations use the `torch.float32 (float)` datatype and other operations use `torch.float16 (half)`. Some ops, like linear layers and convolutions, are much faster in `float16`. Other ops, like reductions, often require the dynamic range of `float32`. Mixed precision tries to match each op to its appropriate datatype.
@@ -27,7 +30,7 @@ AMP allows users to easily enable automatic mixed precision training enabling hi
* Documentation ([Link](https://pytorch.org/docs/stable/amp.html))
* Usage examples ([Link](https://pytorch.org/docs/stable/notes/amp_examples.html))
-## [Beta] Fork/Join Parallelism
+## [Beta] Fork/Join Parallelism
This release adds support for a language-level construct as well as runtime support for coarse-grained parallelism in TorchScript code. This support is useful for situations such as running models in an ensemble in parallel, or running bidirectional components of recurrent nets in parallel, and allows the ability to unlock the computational power of parallel architectures (e.g. many-core CPUs) for task level parallelism.
@@ -48,10 +51,10 @@ def example(x):
print(example(torch.ones([])))
```
-
+
* Documentation ([Link](https://pytorch.org/docs/stable/jit.html))
-## [Beta] Memory Profiler
+## [Beta] Memory Profiler
The `torch.autograd.profiler` API now includes a memory profiler that lets you inspect the tensor memory cost of different operators inside your CPU and GPU models.
@@ -83,7 +86,7 @@ print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))
* PR ([Link](https://github.com/pytorch/pytorch/pull/37775))
* Documentation ([Link](https://pytorch.org/docs/stable/autograd.html#profiler))
-# Distributed Training & RPC
+# Distributed Training & RPC
## [Beta] TensorPipe backend for RPC
@@ -103,11 +106,11 @@ torch.distributed.rpc.rpc_sync(...)
* Design doc ([Link](https://github.com/pytorch/pytorch/issues/35251))
* Documentation ([Link](https://pytorch.org/docs/stable/rpc/index.html))
-## [Beta] DDP+RPC
+## [Beta] DDP+RPC
PyTorch Distributed supports two powerful paradigms: DDP for full sync data parallel training of models and the RPC framework which allows for distributed model parallelism. Previously, these two features worked independently and users couldn’t mix and match these to try out hybrid parallelism paradigms.
-Starting in PyTorch 1.6, we’ve enabled DDP and RPC to work together seamlessly so that users can combine these two techniques to achieve both data parallelism and model parallelism. An example is where users would like to place large embedding tables on parameter servers and use the RPC framework for embedding lookups, but store smaller dense parameters on trainers and use DDP to synchronize the dense parameters. Below is a simple code snippet.
+Starting in PyTorch 1.6, we’ve enabled DDP and RPC to work together seamlessly so that users can combine these two techniques to achieve both data parallelism and model parallelism. An example is where users would like to place large embedding tables on parameter servers and use the RPC framework for embedding lookups, but store smaller dense parameters on trainers and use DDP to synchronize the dense parameters. Below is a simple code snippet.
```python
// On each trainer
@@ -139,11 +142,11 @@ def async_add_chained(to, x, y, z):
)
ret = rpc.rpc_sync(
- "worker1",
- async_add_chained,
+ "worker1",
+ async_add_chained,
args=("worker2", torch.ones(2), 1, 1)
)
-
+
print(ret) # prints tensor([3., 3.])
```
@@ -153,15 +156,15 @@ print(ret) # prints tensor([3., 3.])
# Frontend API Updates
-## [Beta] Complex Numbers
+## [Beta] Complex Numbers
-The PyTorch 1.6 release brings beta level support for complex tensors including torch.complex64 and torch.complex128 dtypes. A complex number is a number that can be expressed in the form a + bj, where a and b are real numbers, and j is a solution of the equation x^2 = −1. Complex numbers frequently occur in mathematics and engineering, especially in signal processing and the area of complex neural networks is an active area of research. The beta release of complex tensors will support common PyTorch and complex tensor functionality, plus functions needed by Torchaudio, ESPnet and others. While this is an early version of this feature, and we expect it to improve over time, the overall goal is provide a NumPy compatible user experience that leverages PyTorch’s ability to run on accelerators and work with autograd to better support the scientific community.
+The PyTorch 1.6 release brings beta level support for complex tensors including torch.complex64 and torch.complex128 dtypes. A complex number is a number that can be expressed in the form a + bj, where a and b are real numbers, and j is a solution of the equation x^2 = −1. Complex numbers frequently occur in mathematics and engineering, especially in signal processing and the area of complex neural networks is an active area of research. The beta release of complex tensors will support common PyTorch and complex tensor functionality, plus functions needed by Torchaudio, ESPnet and others. While this is an early version of this feature, and we expect it to improve over time, the overall goal is provide a NumPy compatible user experience that leverages PyTorch’s ability to run on accelerators and work with autograd to better support the scientific community.
# Updated Domain Libraries
-## torchvision 0.7
+## torchvision 0.7
-torchvision 0.7 introduces two new pretrained semantic segmentation models, [FCN ResNet50](https://arxiv.org/abs/1411.4038) and [DeepLabV3 ResNet50](https://arxiv.org/abs/1706.05587), both trained on COCO and using smaller memory footprints than the ResNet101 backbone. We also introduced support for AMP (Automatic Mixed Precision) autocasting for torchvision models and operators, which automatically selects the floating point precision for different GPU operations to improve performance while maintaining accuracy.
+torchvision 0.7 introduces two new pretrained semantic segmentation models, [FCN ResNet50](https://arxiv.org/abs/1411.4038) and [DeepLabV3 ResNet50](https://arxiv.org/abs/1706.05587), both trained on COCO and using smaller memory footprints than the ResNet101 backbone. We also introduced support for AMP (Automatic Mixed Precision) autocasting for torchvision models and operators, which automatically selects the floating point precision for different GPU operations to improve performance while maintaining accuracy.
* Release notes ([Link](https://github.com/pytorch/vision/releases))
@@ -178,10 +181,10 @@ torchaudio now officially supports Windows. This release also introduces a new m
The Global PyTorch Summer Hackathon is back! This year, teams can compete in three categories virtually:
1. **PyTorch Developer Tools:** Tools or libraries designed to improve productivity and efficiency of PyTorch for researchers and developers
- 2. **Web/Mobile Applications powered by PyTorch:** Applications with web/mobile interfaces and/or embedded devices powered by PyTorch
+ 2. **Web/Mobile Applications powered by PyTorch:** Applications with web/mobile interfaces and/or embedded devices powered by PyTorch
3. **PyTorch Responsible AI Development Tools:** Tools, libraries, or web/mobile apps for responsible AI development
-This is a great opportunity to connect with the community and practice your machine learning skills.
+This is a great opportunity to connect with the community and practice your machine learning skills.
* [Join the hackathon](http://pytorch2020.devpost.com/)
* [Watch educational videos](https://www.youtube.com/pytorch)
@@ -189,11 +192,11 @@ This is a great opportunity to connect with the community and practice your mach
## LPCV Challenge
-The [2020 CVPR Low-Power Vision Challenge (LPCV) - Online Track for UAV video](https://lpcv.ai/2020CVPR/video-track) submission deadline is coming up shortly. You have until July 31, 2020 to build a system that can discover and recognize characters in video captured by an unmanned aerial vehicle (UAV) accurately using PyTorch and Raspberry Pi 3B+.
+The [2020 CVPR Low-Power Vision Challenge (LPCV) - Online Track for UAV video](https://lpcv.ai/2020CVPR/video-track) submission deadline is coming up shortly. You have until July 31, 2020 to build a system that can discover and recognize characters in video captured by an unmanned aerial vehicle (UAV) accurately using PyTorch and Raspberry Pi 3B+.
## Prototype Features
-To reiterate, Prototype features in PyTorch are early features that we are looking to gather feedback on, gauge the usefulness of and improve ahead of graduating them to Beta or Stable. The following features are not part of the PyTorch 1.6 release and instead are available in nightlies with separate docs/tutorials to help facilitate early usage and feedback.
+To reiterate, Prototype features in PyTorch are early features that we are looking to gather feedback on, gauge the usefulness of and improve ahead of graduating them to Beta or Stable. The following features are not part of the PyTorch 1.6 release and instead are available in nightlies with separate docs/tutorials to help facilitate early usage and feedback.
#### Distributed RPC/Profiler
Allow users to profile training jobs that use `torch.distributed.rpc` using the autograd profiler, and remotely invoke the profiler in order to collect profiling information across different nodes. The RFC can be found [here](https://github.com/pytorch/pytorch/issues/39675) and a short recipe on how to use this feature can be found [here](https://github.com/pytorch/tutorials/tree/master/prototype_source).
diff --git a/_sass/blog.scss b/_sass/blog.scss
index 7e898e502bf0..7d439dc16cd1 100644
--- a/_sass/blog.scss
+++ b/_sass/blog.scss
@@ -62,7 +62,7 @@
}
@include desktop {
margin-top: 380px + $desktop_header_height;
- .row.blog-index
+ /*.row.blog-index
[class*="col-"]:not(:first-child):not(:last-child):not(:nth-child(3n)) {
padding-right: rem(35px);
padding-left: rem(35px);
@@ -74,7 +74,7 @@
.row.blog-index [class*="col-"]:nth-child(3n + 1) {
padding-right: rem(35px);
- }
+ }*/
.col-md-4 {
margin-bottom: rem(23px);
@@ -139,7 +139,7 @@
overflow: unset;
white-space: unset;
text-overflow: unset;
- }
+ }
}
h1 {
@@ -221,7 +221,7 @@
}
}
- .page-link {
+ .page-link, .all-blogs {
font-size: rem(20px);
letter-spacing: 0;
line-height: rem(34px);
@@ -230,6 +230,37 @@
text-align: center;
}
+ .all-blogs {
+ width: inherit;
+ padding: 0.5rem 3.75rem;
+ color: $dark_grey;
+ &:hover {
+ color: $orange;
+ }
+ }
+
+ .dropdown {
+ margin-bottom: 3rem;
+ }
+
+ #dropdownMenuButton {
+ cursor: pointer;
+ position: absolute;
+ right: 0;
+ bottom: 1rem;
+ z-index: 1;
+ top: inherit;
+ max-width: 4rem;
+ border: none;
+ background: inherit;
+ padding: inherit;
+ }
+
+ .dropdown-item:hover {
+ color: $orange;
+ cursor: pointer;
+ }
+
@media (max-width: 1067px) {
.jumbotron {
h1 {
@@ -271,3 +302,17 @@ twitterwidget {
margin-bottom: rem(18px) !important;
}
+.blog .pagination {
+ .page {
+ border: 1px solid #dee2e6;
+ padding: 0.5rem 0.75rem;
+ }
+
+ .active .page {
+ background-color: #dee2e6;
+ }
+}
+
+.blog .blog-img {
+ border: 1px solid $dark_grey;
+}
diff --git a/assets/filter-hub-tags.js b/assets/filter-hub-tags.js
index 65e59f0339d0..dfb5cd80e99f 100644
--- a/assets/filter-hub-tags.js
+++ b/assets/filter-hub-tags.js
@@ -2,11 +2,21 @@ var filterScript = $("script[src*=filter-hub-tags]");
var listId = filterScript.attr("list-id");
var displayCount = Number(filterScript.attr("display-count"));
var pagination = filterScript.attr("pagination");
+var options;
+
+if (listId == "all-blog-posts") {
+ options = {
+ valueNames: [{ data: ["tags"] }],
+ page: displayCount
+ };
+}
+else {
+ options = {
+ valueNames: ["github-stars-count-whole-number", { data: ["tags", "date-added", "title"] }],
+ page: displayCount
+ };
+}
-var options = {
- valueNames: ["github-stars-count-whole-number", { data: ["tags", "date-added", "title"] }],
- page: displayCount
-};
$(".next-news-item").on("click" , function(){
$(".pagination").find(".active").next().trigger( "click" );
@@ -101,3 +111,19 @@ $("#sortTitleLow").on("click", function() {
$("#sortTitleHigh").on("click", function() {
hubList.sort("title", { order: "asc" });
});
+
+// Filter the blog posts based on the selected tag
+
+$(".blog-filter-btn").on("click", function() {
+ filterBlogPosts($(this).data("tag"));
+});
+
+function filterBlogPosts(tag) {
+ hubList.filter(function (item) {
+ if (item.values().tags == tag) {
+ return true;
+ } else {
+ return false;
+ }
+ });
+}
diff --git a/blog.html b/blog.html
deleted file mode 100644
index e39a2a2a555c..000000000000
--- a/blog.html
+++ /dev/null
@@ -1,10 +0,0 @@
----
-layout: blog
-title: Blog
-permalink: /blog/
-body-class: blog
-redirect_from: "/blog/categories/"
-pagination:
- enabled: true
- permalink: /:num/
----
diff --git a/blog/all-posts.html b/blog/all-posts.html
new file mode 100644
index 000000000000..a9b442c493e4
--- /dev/null
+++ b/blog/all-posts.html
@@ -0,0 +1,50 @@
+---
+layout: blog
+title: Blog
+permalink: /blog/all-posts
+body-class: blog
+---
+
+{% assign posts = site.posts %}
+
+{% include blog_jumbotron.html posts=posts %}
+
+
+
+
+ {% assign all_tags = posts | map: "tags" | join: ',' | split: ',' | uniq | sort %}
+
+
+
+
+
+
+
+
+ {% for post in posts %}
+
+

+
+
{{ post.preview | truncate: 150 }}
+
{{ post.date | date: '%B %d, %Y' }}
+
+ {% endfor %}
+
+
+
+
+
+
+
+
+
+
+
diff --git a/blog/landing-page.html b/blog/landing-page.html
new file mode 100644
index 000000000000..c166f4c17ad4
--- /dev/null
+++ b/blog/landing-page.html
@@ -0,0 +1,36 @@
+---
+layout: blog
+title: Blog
+permalink: /blog/
+body-class: blog
+redirect_from: "/blog/categories/"
+pagination:
+ enabled: true
+ permalink: /:num/
+---
+
+{% assign posts = paginator.posts %}
+
+{% include blog_jumbotron.html posts=posts %}
+
+
+
+
+
+ {% for post in posts %}
+
+
{{ post.date | date: '%B %d, %Y' }}
+
+
{{ post.excerpt | remove: '
' | remove: '
' | truncate: 500 | strip_html }}
+
+ {% endfor %}
+
+
+
+
+
+