You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2020-08-08-pytorch-1.6-now-includes-stochastic-weight-averaging.md
+7-16Lines changed: 7 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -8,12 +8,12 @@ Do you use stochastic gradient descent (SGD) or Adam? Regardless of the procedur
8
8
9
9
10
10
SWA has a wide range of applications and features:
11
-
*1 SWA significantly improves performance compared to standard training techniques in computer vision (e.g., VGG, ResNets, Wide ResNets and DenseNets on ImageNet and CIFAR benchmarks [1, 2])
12
-
*2 SWA provides state-of-the-art performance on key benchmarks in semi-supervised learning and domain adaptation [2].
13
-
*3 SWA was shown to improve performance in language modeling (e.g., AWD-LSTM on WikiText-2 [4]) and policy-gradient methods in deep reinforcement learning [3].
14
-
*4 SWAG, an extension of SWA, can approximate Bayesian model averaging in Bayesian deep learning and achieves state-of-the-art uncertainty calibration results in various settings. Moreover, its recent generalization MultiSWAG provides significant additional performance gains and mitigates double-descent [4, 10]. Another approach, Subspace Inference, approximates the Bayesian posterior in a small subspace of the parameter space around the SWA solution [5].
15
-
*5 SWA for low precision training, SWALP, can match the performance of full-precision SGD training, even with all numbers quantized down to 8 bits, including gradient accumulators [6].
16
-
*6 SWA in parallel, SWAP, was shown to greatly speed up the training of neural networks by using large batch sizes and ,in particular, set a record by training a neural network to 94% accuracy on CIFAR-10 in 27 seconds [11].
11
+
* SWA significantly improves performance compared to standard training techniques in computer vision (e.g., VGG, ResNets, Wide ResNets and DenseNets on ImageNet and CIFAR benchmarks [1, 2])
12
+
* SWA provides state-of-the-art performance on key benchmarks in semi-supervised learning and domain adaptation [2].
13
+
* SWA was shown to improve performance in language modeling (e.g., AWD-LSTM on WikiText-2 [4]) and policy-gradient methods in deep reinforcement learning [3].
14
+
* SWAG, an extension of SWA, can approximate Bayesian model averaging in Bayesian deep learning and achieves state-of-the-art uncertainty calibration results in various settings. Moreover, its recent generalization MultiSWAG provides significant additional performance gains and mitigates double-descent [4, 10]. Another approach, Subspace Inference, approximates the Bayesian posterior in a small subspace of the parameter space around the SWA solution [5].
15
+
* SWA for low precision training, SWALP, can match the performance of full-precision SGD training, even with all numbers quantized down to 8 bits, including gradient accumulators [6].
16
+
* SWA in parallel, SWAP, was shown to greatly speed up the training of neural networks by using large batch sizes and, in particular, set a record by training a neural network to 94% accuracy on CIFAR-10 in 27 seconds [11].
17
17
18
18
19
19
<divclass="text-center">
@@ -24,7 +24,7 @@ SWA has a wide range of applications and features:
24
24
25
25
In short, SWA performs an equal average of the weights traversed by SGD (or any stochastic optimizer) with a modified learning rate schedule (see the left panel of Figure 1.). SWA solutions end up in the center of a wide flat region of loss, while SGD tends to converge to the boundary of the low-loss region, making it susceptible to the shift between train and test error surfaces (see the middle and right panels of Figure 1). We emphasize that SWA **can be used with any optimizer, such as Adam, and is not specific to SGD**.
26
26
27
-
In PyTorch 1.6 we provide a new convenient implementation of SWA in [torch.optim.swa_utils](https://pytorch.org/docs/stable/optim.html#stochastic-weight-averaging)
27
+
Previously, SWA was in PyTorch contrib. In PyTorch 1.6, we provide a new convenient implementation of SWA in [torch.optim.swa_utils](https://pytorch.org/docs/stable/optim.html#stochastic-weight-averaging).
28
28
29
29
## Is this just Averaged SGD?
30
30
@@ -313,12 +313,3 @@ Gupta, Vipul, Santiago Akle Serrano, and Dennis DeCoste; International Conferenc
313
313
[12] SQWA: Stochastic Quantized Weight Averaging for Improving the Generalization Capability of Low-Precision Deep Neural Networks
314
314
Shin, Sungho, Yoonho Boo, and Wonyong Sung; arXiv preprint 2020.
0 commit comments