Update 2020-08-08-pytorch-1.6-now-includes-stochastic-weight-averaging.md

wookim3 · web-flow · commit 49ff5cb4103e · 2020-08-18T02:33:53.000-07:00
diff --git a/_posts/2020-08-08-pytorch-1.6-now-includes-stochastic-weight-averaging.md b/_posts/2020-08-08-pytorch-1.6-now-includes-stochastic-weight-averaging.md
@@ -8,12 +8,12 @@ Do you use stochastic gradient descent (SGD) or Adam? Regardless of the procedur
 
 
 SWA has a wide range of applications and features:
-* 1 SWA significantly improves performance compared to standard training techniques in computer vision (e.g., VGG, ResNets, Wide ResNets and DenseNets on ImageNet and CIFAR benchmarks [1, 2])
-* 2 SWA provides state-of-the-art performance on key benchmarks in semi-supervised learning and domain adaptation [2].
-* 3 SWA was shown to improve performance in language modeling (e.g., AWD-LSTM on WikiText-2 [4]) and policy-gradient methods in deep reinforcement learning [3].
-* 4 SWAG, an extension of SWA, can approximate Bayesian model averaging in Bayesian deep learning and achieves state-of-the-art uncertainty calibration results in various settings. Moreover, its recent generalization MultiSWAG provides significant additional performance gains and mitigates double-descent [4, 10]. Another approach, Subspace Inference, approximates the Bayesian posterior in a small subspace of the parameter space around the SWA solution [5].
-* 5 SWA for low precision training, SWALP, can match the performance of full-precision SGD training, even with all numbers quantized down to 8 bits, including gradient accumulators [6].
-* 6 SWA in parallel, SWAP, was shown to greatly speed up the training of neural networks by using large batch sizes and ,in particular, set a record by training a neural network to 94% accuracy on CIFAR-10 in 27 seconds [11].
+* SWA significantly improves performance compared to standard training techniques in computer vision (e.g., VGG, ResNets, Wide ResNets and DenseNets on ImageNet and CIFAR benchmarks [1, 2])
+* SWA provides state-of-the-art performance on key benchmarks in semi-supervised learning and domain adaptation [2].
+* SWA was shown to improve performance in language modeling (e.g., AWD-LSTM on WikiText-2 [4]) and policy-gradient methods in deep reinforcement learning [3].
+* SWAG, an extension of SWA, can approximate Bayesian model averaging in Bayesian deep learning and achieves state-of-the-art uncertainty calibration results in various settings. Moreover, its recent generalization MultiSWAG provides significant additional performance gains and mitigates double-descent [4, 10]. Another approach, Subspace Inference, approximates the Bayesian posterior in a small subspace of the parameter space around the SWA solution [5].
+* SWA for low precision training, SWALP, can match the performance of full-precision SGD training, even with all numbers quantized down to 8 bits, including gradient accumulators [6].
+* SWA in parallel, SWAP, was shown to greatly speed up the training of neural networks by using large batch sizes and, in particular, set a record by training a neural network to 94% accuracy on CIFAR-10 in 27 seconds [11].
 
 
 <div class="text-center">
@@ -24,7 +24,7 @@ SWA has a wide range of applications and features:
 
 In short, SWA performs an equal average of the weights traversed by SGD (or any stochastic optimizer) with a modified learning rate schedule (see the left panel of Figure 1.). SWA solutions end up in the center of a wide flat region of loss, while SGD tends to converge to the boundary of the low-loss region, making it susceptible to the shift between train and test error surfaces (see the middle and right panels of Figure 1). We emphasize that SWA **can be used with any optimizer, such as Adam, and is not specific to SGD**.
 
-In PyTorch 1.6 we provide a new convenient implementation of SWA in [torch.optim.swa_utils](https://pytorch.org/docs/stable/optim.html#stochastic-weight-averaging)
+Previously, SWA was in PyTorch contrib. In PyTorch 1.6, we provide a new convenient implementation of SWA in [torch.optim.swa_utils](https://pytorch.org/docs/stable/optim.html#stochastic-weight-averaging).
 
 ## Is this just Averaged SGD?
 
@@ -313,12 +313,3 @@ Gupta, Vipul, Santiago Akle Serrano, and Dennis DeCoste; International Conferenc
 [12] SQWA: Stochastic Quantized Weight Averaging for Improving the Generalization Capability of Low-Precision Deep Neural Networks
 Shin, Sungho, Yoonho Boo, and Wonyong Sung; arXiv preprint 2020.
 
-
-
-
-
-
-
-
-
-