You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2022-1-19-quantization-in-practice.md
+5-3Lines changed: 5 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ author: Suraj Subramanian, Jerry Zhang
5
5
featured-img: ''
6
6
---
7
7
8
-
There are a few different ways to quantize your model with PyTorch. In this blog post, we'll take a look at how each technique looks like in practice. I will use a non-standard model that is not traceable, to paint an accurate picture of how much effort is really needed when quantizing your model.
8
+
Quantization is a cheap and easy way to make your DNN run faster and with lower memory requirements. PyTorch offers a few different approaches to quantize your model. In this blog post, we'll lay a (quick) foundation of quantization in deep learning, and then take a look at how each technique looks like in practice. Finally we'll end with recommendations from the literature for using quantization in your workflows.
@@ -23,7 +23,7 @@ Overparameterized DNNs have more degrees of freedom and this makes them good can
23
23
24
24
At the heart of it all is a **mapping function**, a linear projection from floating-point to integer space: <imgsrc="https://latex.codecogs.com/gif.latex?Q(r) = round(r/S + Z)">
25
25
26
-
To reconvert to floating point space, the inverse function is given by <imgsrc="https://latex.codecogs.com/gif.latex?math=\tilde r = (Q(r) - Z) \cdot S">.
26
+
To reconvert to floating point space, the inverse function is given by <imgsrc="https://latex.codecogs.com/gif.latex?\tilde r = (Q(r) - Z) \cdot S">.
27
27
28
28
<imgsrc="https://latex.codecogs.com/gif.latex?\tilde r \neq r">, and their difference constitutes the *quantization error*.
29
29
@@ -36,7 +36,9 @@ The zero-point <img src="https://latex.codecogs.com/gif.latex?Z"> acts as a bias
36
36
37
37
38
38
### Calibration
39
-
The process of choosing the input range is known as **calibration**. The simplest technique (also the default in PyTorch) is to record the running mininmum and maximum values and assign them to <img src="https://latex.codecogs.com/gif.latex?\alpha"> and <img src="https://latex.codecogs.com/gif.latex?\beta">. [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/calib.html) also uses entropy minimization (KL divergence), mean-square-error minimization, or percentiles of the input range. In PyTorch, `Observer` modules ([docs](https://PyTorch.org/docs/stable/torch.quantization.html?highlight=observer#observers), [code](https://github.com/PyTorch/PyTorch/blob/748d9d24940cd17938df963456c90fa1a13f3932/torch/ao/quantization/observer.py#L88)) collect statistics on the input values and calculate the qparams <img src="https://latex.codecogs.com/gif.latex?S, Z">. Different calibration schemes result in different quantized outputs, and it's best to empirically verify which scheme works best for your application and architecture (more on that later).
39
+
The process of choosing the input range is known as **calibration**. The simplest technique (also the default in PyTorch) is to record the running mininmum and maximum values and assign them to <imgsrc="https://latex.codecogs.com/gif.latex?\alpha"> and <imgsrc="https://latex.codecogs.com/gif.latex?\beta">. [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/calib.html) also uses entropy minimization (KL divergence), mean-square-error minimization, or percentiles of the input range.
40
+
41
+
In PyTorch, `Observer` modules ([docs](https://PyTorch.org/docs/stable/torch.quantization.html?highlight=observer#observers), [code](https://github.com/PyTorch/PyTorch/blob/748d9d24940cd17938df963456c90fa1a13f3932/torch/ao/quantization/observer.py#L88)) collect statistics on the input values and calculate the qparams <imgsrc="https://latex.codecogs.com/gif.latex?S, Z">. Different calibration schemes result in different quantized outputs, and it's best to empirically verify which scheme works best for your application and architecture (more on that later).
40
42
41
43
```python
42
44
from torch.quantization.observer import MinMaxObserver, MovingAverageMinMaxObserver, HistogramObserver
0 commit comments