Skip to content

Commit 49ff5cb

Browse files
authored
Update 2020-08-08-pytorch-1.6-now-includes-stochastic-weight-averaging.md
1 parent 2de0b5e commit 49ff5cb

File tree

1 file changed

+7
-16
lines changed

1 file changed

+7
-16
lines changed

_posts/2020-08-08-pytorch-1.6-now-includes-stochastic-weight-averaging.md

Lines changed: 7 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,12 @@ Do you use stochastic gradient descent (SGD) or Adam? Regardless of the procedur
88

99

1010
SWA has a wide range of applications and features:
11-
* 1 SWA significantly improves performance compared to standard training techniques in computer vision (e.g., VGG, ResNets, Wide ResNets and DenseNets on ImageNet and CIFAR benchmarks [1, 2])
12-
* 2 SWA provides state-of-the-art performance on key benchmarks in semi-supervised learning and domain adaptation [2].
13-
* 3 SWA was shown to improve performance in language modeling (e.g., AWD-LSTM on WikiText-2 [4]) and policy-gradient methods in deep reinforcement learning [3].
14-
* 4 SWAG, an extension of SWA, can approximate Bayesian model averaging in Bayesian deep learning and achieves state-of-the-art uncertainty calibration results in various settings. Moreover, its recent generalization MultiSWAG provides significant additional performance gains and mitigates double-descent [4, 10]. Another approach, Subspace Inference, approximates the Bayesian posterior in a small subspace of the parameter space around the SWA solution [5].
15-
* 5 SWA for low precision training, SWALP, can match the performance of full-precision SGD training, even with all numbers quantized down to 8 bits, including gradient accumulators [6].
16-
* 6 SWA in parallel, SWAP, was shown to greatly speed up the training of neural networks by using large batch sizes and ,in particular, set a record by training a neural network to 94% accuracy on CIFAR-10 in 27 seconds [11].
11+
* SWA significantly improves performance compared to standard training techniques in computer vision (e.g., VGG, ResNets, Wide ResNets and DenseNets on ImageNet and CIFAR benchmarks [1, 2])
12+
* SWA provides state-of-the-art performance on key benchmarks in semi-supervised learning and domain adaptation [2].
13+
* SWA was shown to improve performance in language modeling (e.g., AWD-LSTM on WikiText-2 [4]) and policy-gradient methods in deep reinforcement learning [3].
14+
* SWAG, an extension of SWA, can approximate Bayesian model averaging in Bayesian deep learning and achieves state-of-the-art uncertainty calibration results in various settings. Moreover, its recent generalization MultiSWAG provides significant additional performance gains and mitigates double-descent [4, 10]. Another approach, Subspace Inference, approximates the Bayesian posterior in a small subspace of the parameter space around the SWA solution [5].
15+
* SWA for low precision training, SWALP, can match the performance of full-precision SGD training, even with all numbers quantized down to 8 bits, including gradient accumulators [6].
16+
* SWA in parallel, SWAP, was shown to greatly speed up the training of neural networks by using large batch sizes and, in particular, set a record by training a neural network to 94% accuracy on CIFAR-10 in 27 seconds [11].
1717

1818

1919
<div class="text-center">
@@ -24,7 +24,7 @@ SWA has a wide range of applications and features:
2424

2525
In short, SWA performs an equal average of the weights traversed by SGD (or any stochastic optimizer) with a modified learning rate schedule (see the left panel of Figure 1.). SWA solutions end up in the center of a wide flat region of loss, while SGD tends to converge to the boundary of the low-loss region, making it susceptible to the shift between train and test error surfaces (see the middle and right panels of Figure 1). We emphasize that SWA **can be used with any optimizer, such as Adam, and is not specific to SGD**.
2626

27-
In PyTorch 1.6 we provide a new convenient implementation of SWA in [torch.optim.swa_utils](https://pytorch.org/docs/stable/optim.html#stochastic-weight-averaging)
27+
Previously, SWA was in PyTorch contrib. In PyTorch 1.6, we provide a new convenient implementation of SWA in [torch.optim.swa_utils](https://pytorch.org/docs/stable/optim.html#stochastic-weight-averaging).
2828

2929
## Is this just Averaged SGD?
3030

@@ -313,12 +313,3 @@ Gupta, Vipul, Santiago Akle Serrano, and Dennis DeCoste; International Conferenc
313313
[12] SQWA: Stochastic Quantized Weight Averaging for Improving the Generalization Capability of Low-Precision Deep Neural Networks
314314
Shin, Sungho, Yoonho Boo, and Wonyong Sung; arXiv preprint 2020.
315315

316-
317-
318-
319-
320-
321-
322-
323-
324-

0 commit comments

Comments
 (0)