You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2022-1-19-quantization-in-practice.md
+17-10Lines changed: 17 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,30 +13,39 @@ Quantization is a cheap and easy way to make your DNN run faster and with lower
13
13
Fig 1. PyTorch <3 Quantization
14
14
</p>
15
15
16
-
## A quick introduction to quantization
16
+
**Contents**
17
+
* TOC
18
+
{:toc}
19
+
## Fundamentals of Quantization
17
20
18
21
> If someone asks you what time it is, you don't respond "10:14:34:430705", but you might say "a quarter past 10".
19
22
20
23
Quantization has roots in information compression; in deep networks it refers to reducing the numerical precision of its weights and/or activations.
21
24
22
25
Overparameterized DNNs have more degrees of freedom and this makes them good candidates for information compression [[1]]. When you quantize a model, two things generally happen - the model gets smaller and runs with better efficiency. Hardware vendors explicitly allow for faster processing of 8-bit data (than 32-bit data) resulting in higher throughput. A smaller model has lower memory footprint and power consumption [[2]], crucial for deployment at the edge.
23
26
24
-
At the heart of it all is a **mapping function**, a linear projection from floating-point to integer space: <imgsrc="https://latex.codecogs.com/gif.latex?Q(r) = round(r/S + Z)">
27
+
### Mapping function
28
+
The mapping function is what you might guess - a function that maps values from floating-point to integer space. A commonly used mapping function is a linear transformation given by <imgsrc="https://latex.codecogs.com/gif.latex?Q(r) = round(r/S + Z)">, where <imgsrc="https://latex.codecogs.com/gif.latex?r"> is the input and <imgsrc="https://latex.codecogs.com/gif.latex?S, Z"> are **quantization parameters**.
25
29
26
30
To reconvert to floating point space, the inverse function is given by <imgsrc="https://latex.codecogs.com/gif.latex?\tilde r = (Q(r) - Z) \cdot S">.
27
31
28
32
<imgsrc="https://latex.codecogs.com/gif.latex?\tilde r \neq r">, and their difference constitutes the *quantization error*.
29
33
30
-
The scaling factor <imgsrc="https://latex.codecogs.com/gif.latex?S"> is simply the ratio of the input range to the output range: <imgsrc="https://latex.codecogs.com/gif.latex?S = \frac{\beta - \alpha}{\beta_q - \alpha_q}">
34
+
### Quantization Parameters
35
+
The mapping function is parameterized by the **scaling factor** <imgsrc="https://latex.codecogs.com/gif.latex?S"> and **zero-point** <imgsrc="https://latex.codecogs.com/gif.latex?Z">.
36
+
37
+
<imgsrc="https://latex.codecogs.com/gif.latex?S"> is simply the ratio of the input range to the output range
where [<imgsrc="https://latex.codecogs.com/gif.latex?\alpha, \beta">] is the clipping range of the input, i.e. the boundaries of permissible inputs. [<imgsrc="https://latex.codecogs.com/gif.latex?\alpha_q, \beta_q">] is the range in quantized output space that it is mapped to. For 8-bit quantization, the output range <imgsrc="https://latex.codecogs.com/gif.latex?\beta_q - \alpha_q <= (2^8 - 1)">.
32
41
33
-
The zero-point <imgsrc="https://latex.codecogs.com/gif.latex?Z"> acts as a bias to ensure that a 0 in the input space maps perfectly to a 0 in the quantized space. <imgsrc="https://latex.codecogs.com/gif.latex?Z = -(\frac{\alpha}{S} - \alpha_q)">
34
42
35
-
<imgsrc="https://latex.codecogs.com/gif.latex?S, Z"> can be calculated and used for quantizing an entire tensor ("per-tensor"), or individually for each channel ("per-channel").
43
+
<imgsrc="https://latex.codecogs.com/gif.latex?Z"> acts as a bias to ensure that a 0 in the input space maps perfectly to a 0 in the quantized space. <imgsrc="https://latex.codecogs.com/gif.latex?Z = -(\frac{\alpha}{S} - \alpha_q)">
44
+
36
45
37
46
38
47
### Calibration
39
-
The process of choosing the input range is known as **calibration**. The simplest technique (also the default in PyTorch) is to record the running mininmum and maximum values and assign them to <imgsrc="https://latex.codecogs.com/gif.latex?\alpha"> and <imgsrc="https://latex.codecogs.com/gif.latex?\beta">. [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/calib.html) also uses entropy minimization (KL divergence), mean-square-error minimization, or percentiles of the input range.
48
+
The process of choosing the input clipping range is known as **calibration**. The simplest technique (also the default in PyTorch) is to record the running mininmum and maximum values and assign them to <imgsrc="https://latex.codecogs.com/gif.latex?\alpha"> and <imgsrc="https://latex.codecogs.com/gif.latex?\beta">. [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/calib.html) also uses entropy minimization (KL divergence), mean-square-error minimization, or percentiles of the input range.
40
49
41
50
In PyTorch, `Observer` modules ([docs](https://PyTorch.org/docs/stable/torch.quantization.html?highlight=observer#observers), [code](https://github.com/PyTorch/PyTorch/blob/748d9d24940cd17938df963456c90fa1a13f3932/torch/ao/quantization/observer.py#L88)) collect statistics on the input values and calculate the qparams <imgsrc="https://latex.codecogs.com/gif.latex?S, Z">. Different calibration schemes result in different quantized outputs, and it's best to empirically verify which scheme works best for your application and architecture (more on that later).
Currently, quantized operators run on x86 machines via the [FBGEMM backend](https://github.com/pytorch/FBGEMM), or use [QNNPACK](https://github.com/pytorch/QNNPACK) primitives on ARM machines. Backend support for server GPUs (via TensorRT and cuDNN) is coming soon. Learn more about extending quantization to custom backends: [RFC-0019](https://github.com/pytorch/rfcs/blob/master/RFC-0019-Extending-PyTorch-Quantization-to-Custom-Backends.md).
0 commit comments