Skip to content

Commit a7762fc

Browse files
committed
add toc
1 parent 18be151 commit a7762fc

File tree

1 file changed

+17
-10
lines changed

1 file changed

+17
-10
lines changed

_posts/2022-1-19-quantization-in-practice.md

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,30 +13,39 @@ Quantization is a cheap and easy way to make your DNN run faster and with lower
1313
Fig 1. PyTorch <3 Quantization
1414
</p>
1515

16-
## A quick introduction to quantization
16+
**Contents**
17+
* TOC
18+
{:toc}
19+
## Fundamentals of Quantization
1720

1821
> If someone asks you what time it is, you don't respond "10:14:34:430705", but you might say "a quarter past 10".
1922
2023
Quantization has roots in information compression; in deep networks it refers to reducing the numerical precision of its weights and/or activations.
2124

2225
Overparameterized DNNs have more degrees of freedom and this makes them good candidates for information compression [[1]]. When you quantize a model, two things generally happen - the model gets smaller and runs with better efficiency. Hardware vendors explicitly allow for faster processing of 8-bit data (than 32-bit data) resulting in higher throughput. A smaller model has lower memory footprint and power consumption [[2]], crucial for deployment at the edge.
2326

24-
At the heart of it all is a **mapping function**, a linear projection from floating-point to integer space: <img src="https://latex.codecogs.com/gif.latex?Q(r) = round(r/S + Z)">
27+
### Mapping function
28+
The mapping function is what you might guess - a function that maps values from floating-point to integer space. A commonly used mapping function is a linear transformation given by <img src="https://latex.codecogs.com/gif.latex?Q(r) = round(r/S + Z)">, where <img src="https://latex.codecogs.com/gif.latex?r"> is the input and <img src="https://latex.codecogs.com/gif.latex?S, Z"> are **quantization parameters**.
2529

2630
To reconvert to floating point space, the inverse function is given by <img src="https://latex.codecogs.com/gif.latex?\tilde r = (Q(r) - Z) \cdot S">.
2731

2832
<img src="https://latex.codecogs.com/gif.latex?\tilde r \neq r">, and their difference constitutes the *quantization error*.
2933

30-
The scaling factor <img src="https://latex.codecogs.com/gif.latex?S"> is simply the ratio of the input range to the output range: <img src="https://latex.codecogs.com/gif.latex?S = \frac{\beta - \alpha}{\beta_q - \alpha_q}">
34+
### Quantization Parameters
35+
The mapping function is parameterized by the **scaling factor** <img src="https://latex.codecogs.com/gif.latex?S"> and **zero-point** <img src="https://latex.codecogs.com/gif.latex?Z">.
36+
37+
<img src="https://latex.codecogs.com/gif.latex?S"> is simply the ratio of the input range to the output range
38+
<img src="https://latex.codecogs.com/gif.latex?S = \frac{\beta - \alpha}{\beta_q - \alpha_q}">
39+
3140
where [<img src="https://latex.codecogs.com/gif.latex?\alpha, \beta">] is the clipping range of the input, i.e. the boundaries of permissible inputs. [<img src="https://latex.codecogs.com/gif.latex?\alpha_q, \beta_q">] is the range in quantized output space that it is mapped to. For 8-bit quantization, the output range <img src="https://latex.codecogs.com/gif.latex?\beta_q - \alpha_q <= (2^8 - 1)">.
3241

33-
The zero-point <img src="https://latex.codecogs.com/gif.latex?Z"> acts as a bias to ensure that a 0 in the input space maps perfectly to a 0 in the quantized space. <img src="https://latex.codecogs.com/gif.latex?Z = -(\frac{\alpha}{S} - \alpha_q)">
3442

35-
<img src="https://latex.codecogs.com/gif.latex?S, Z"> can be calculated and used for quantizing an entire tensor ("per-tensor"), or individually for each channel ("per-channel").
43+
<img src="https://latex.codecogs.com/gif.latex?Z"> acts as a bias to ensure that a 0 in the input space maps perfectly to a 0 in the quantized space. <img src="https://latex.codecogs.com/gif.latex?Z = -(\frac{\alpha}{S} - \alpha_q)">
44+
3645

3746

3847
### Calibration
39-
The process of choosing the input range is known as **calibration**. The simplest technique (also the default in PyTorch) is to record the running mininmum and maximum values and assign them to <img src="https://latex.codecogs.com/gif.latex?\alpha"> and <img src="https://latex.codecogs.com/gif.latex?\beta">. [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/calib.html) also uses entropy minimization (KL divergence), mean-square-error minimization, or percentiles of the input range.
48+
The process of choosing the input clipping range is known as **calibration**. The simplest technique (also the default in PyTorch) is to record the running mininmum and maximum values and assign them to <img src="https://latex.codecogs.com/gif.latex?\alpha"> and <img src="https://latex.codecogs.com/gif.latex?\beta">. [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/calib.html) also uses entropy minimization (KL divergence), mean-square-error minimization, or percentiles of the input range.
4049

4150
In PyTorch, `Observer` modules ([docs](https://PyTorch.org/docs/stable/torch.quantization.html?highlight=observer#observers), [code](https://github.com/PyTorch/PyTorch/blob/748d9d24940cd17938df963456c90fa1a13f3932/torch/ao/quantization/observer.py#L88)) collect statistics on the input values and calculate the qparams <img src="https://latex.codecogs.com/gif.latex?S, Z">. Different calibration schemes result in different quantized outputs, and it's best to empirically verify which scheme works best for your application and architecture (more on that later).
4251

@@ -145,8 +154,6 @@ print(obs.calculate_qparams())
145154
# (tensor([0.0090, 0.0075, 0.0055]), tensor([125, 187, 82], dtype=torch.int32))
146155
```
147156

148-
149-
150157
### Backend Engine
151158
Currently, quantized operators run on x86 machines via the [FBGEMM backend](https://github.com/pytorch/FBGEMM), or use [QNNPACK](https://github.com/pytorch/QNNPACK) primitives on ARM machines. Backend support for server GPUs (via TensorRT and cuDNN) is coming soon. Learn more about extending quantization to custom backends: [RFC-0019](https://github.com/pytorch/rfcs/blob/master/RFC-0019-Extending-PyTorch-Quantization-to-Custom-Backends.md).
152159

@@ -173,9 +180,9 @@ my_qconfig = torch.quantization.QConfig(
173180
```
174181

175182

176-
## Techniques in PyTorch
183+
## In PyTorch
177184

178-
PyTorch allows you a few different ways to quantize your model on the CPU, depending on
185+
PyTorch allows you a few different ways to quantize your model depending on
179186
- if you prefer a flexible but manual, or a restricted automagic process (*Eager Mode* v/s *FX Graph Mode*)
180187
- if qparams for quantizing activations (layer outputs) are precomputed for all inputs, or calculated afresh with each input (*static* v/s *dynamic*),
181188
- if qparams are computed with or without retraining (*quantization-aware training* v/s *post-training quantization*)

0 commit comments

Comments
 (0)