Skip to content

Commit 6c0f10e

Browse files
committed
text edits
1 parent a9ec8f8 commit 6c0f10e

File tree

1 file changed

+11
-12
lines changed

1 file changed

+11
-12
lines changed

_posts/2022-1-19-quantization-in-practice.md

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,11 @@ There are a few different ways to quantize your model with PyTorch. In this blog
1111
<img src="/assets/images/quantization_gif.gif" width="60%">
1212
</div>
1313

14-
## What happens when you "quantize" a model?
15-
16-
Two things, generally - the model gets smaller and runs faster. This is because adding and multiplying 8-bit numbers is faster than 32-bit numbers. Loading a smaller model from memory reduces I/O, making models more energy efficient.
14+
## A quick introduction to quantization
1715

1816
> If someone asks you what time it is, you don't respond "10:14:34:430705", but you might say "a quarter past 10".
1917
20-
Quantizing a model means reducing the numerical precision of its weights and/or activations a.k.a information compression. Quantization of deep networks is especially interesting because overparameterized DNNs have more degrees of freedom and this makes them good candidates for information compression.
18+
Quantization has roots in information compression; in deep networks it refers to reducing the numerical precision of its weights and/or activations. Overparameterized DNNs have more degrees of freedom and this makes them good candidates for information compression. When you quantize a model, two things generally happen - the model gets smaller and runs with better efficiency. Processing 8-bit numbers is faster than 32-bit numbers, and a smaller model has lower memory footprint and power consumption.
2119

2220
At the heart of it all is a **mapping function**, a linear projection from floating-point to integer space: $Q(r) = round(r/S + Z)$
2321

@@ -28,13 +26,14 @@ where [$\alpha, \beta$] is the clipping range of the input, i.e. the boundaries
2826

2927
The process of choosing the appropriate input range is known as **calibration**; commonly used methods are MinMax and Entropy.
3028

31-
The zero-point $Z$ acts as a bias to ensure that a 0 in the input space maps perfectly to a 0 in the quantized space. $Z = -(\frac{\alpha}{S} - \alpha_q)$
29+
The zero-point $Z$ acts as a bias to ensure that a 0 in the input space maps perfectly to a 0 in the quantized space. $Z = -(\frac{\alpha}{S} - \alpha_q)$
3230

3331

3432
### Quantization Schemes
3533
$S, Z$ can be calculated and used for quantizing an entire tensor ("per-tensor"), or individually for each channel ("per-channel").
3634

37-
When [$\alpha, \beta$] are centered around 0, it is called **symmetric quantization**. The range is calculated as $-\alpha = \beta = max(|max(r)|,|min(r)|)$. This removes the need of a zero-point offset in the mapping function. Asymmetric or **affine** schemes simply assign the boundaries to the minimum and maximum observed values. Asymmetric schemes have a tighter clipping range (for non-negative ReLU activations, for instance) but require a non-zero offset.
35+
When [$\alpha, \beta$] are centered around 0, it is called **symmetric quantization**. The range is calculated as
36+
$-\alpha = \beta = max(|max(r)|,|min(r)|)$. This removes the need of a zero-point offset in the mapping function. Asymmetric or **affine** schemes simply assign the boundaries to the minimum and maximum observed values. Asymmetric schemes have a tighter clipping range (for non-negative ReLU activations, for instance) but require a non-zero offset.
3837

3938

4039
### PyTorch Classes
@@ -45,7 +44,7 @@ The `QConfig` ([code](https://github.com/PyTorch/PyTorch/blob/d6b15bfcbdaff8eb73
4544

4645
## In PyTorch
4746

48-
PyTorch allows you a few different ways to quantize your model, depending on
47+
PyTorch allows you a few different ways to quantize your model on the CPU, depending on
4948
- if you prefer a manual, or a more automatic process (*Eager Mode* v/s *FX Graph Mode*)
5049
- if $S, Z$ for quantizing activations (layer outputs) are precomputed for all inputs, or calculated afresh with each input (*static* v/s *dynamic*),
5150
- if $S, Z$ are computed during, or after training (*quantization-aware training* v/s *post-training quantization*)
@@ -170,9 +169,10 @@ It's likely that you can still use QAT by "fine-tuning" it on a sample of the tr
170169

171170
## Quantizing "real-world" models
172171

172+
**Download the [notebook](https://gist.github.com/suraj813/735357e56321237950a0348b50f2f3b4) or run it on [Colab](https://colab.research.google.com/gist/suraj813/735357e56321237950a0348b50f2f3b4/fx-and-eager-mode-quantization-example.ipynb) (note that Colab runtimes may differ significantly from local machines).**
173+
173174
Traceable models can be easily quantized with FX Graph Mode, but it's possible the model you're using is not traceable end-to-end. Maybe it has loops or `if` statements on inputs (dynamic control flow), or relies on third-party libraries. The model I use in this example has [dynamic control flow and uses third-party libraries](https://github.com/facebookresearch/demucs/blob/v2/demucs/model.py). As a result, it cannot be symbolically traced directly. In this code walkthrough, I show how you can bypass this limitation by quantizing the child modules individually for FX Graph Mode, and how to patch Quant/DeQuant stubs in Eager Mode.
174175

175-
Download the [notebook](https://gist.github.com/suraj813/735357e56321237950a0348b50f2f3b4) or run it on [Colab](https://colab.research.google.com/gist/suraj813/735357e56321237950a0348b50f2f3b4/fx-and-eager-mode-quantization-example.ipynb) (note that Colab runtimes may differ significantly from local machines).
176176

177177

178178
## What's next - Define-by-Run Quantization
@@ -194,7 +194,6 @@ DBR is an early prototype [code](https://github.com/PyTorch/PyTorch/tree/master/
194194

195195

196196
## References
197-
[Quantization Docs](https://pytorch.org/docs/stable/quantization.html#prototype-fx-graph-mode-quantization)
198-
[Integer quantization for deep learning inference: Principles and empirical evaluation (arxiv)](https://arxiv.org/abs/2004.09602)
199-
[A Survey of Quantization Methods for Efficient Neural Network Inference (arxiv)](https://arxiv.org/pdf/2103.13630.pdf)
200-
arxiv
197+
- [Quantization Docs](https://pytorch.org/docs/stable/quantization.html#prototype-fx-graph-mode-quantization)
198+
- [Integer quantization for deep learning inference: Principles and empirical evaluation (arxiv)](https://arxiv.org/abs/2004.09602)
199+
- [A Survey of Quantization Methods for Efficient Neural Network Inference (arxiv)](https://arxiv.org/pdf/2103.13630.pdf)

0 commit comments

Comments
 (0)