Skip to content

Andrew #1748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 27, 2024
Merged

Andrew #1748

Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions _posts/2024-09-26-pytorch-native-architecture-optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ title: "PyTorch Native Architecture Optimization: torchao"
author: Team PyTorch
---


We’re happy to officially launch torchao, a PyTorch native library that makes models faster and smaller by leveraging low bit dtypes, quantization and sparsity. [torchao](https://github.com/pytorch/ao) is an accessible toolkit of techniques written (mostly) in easy to read PyTorch code spanning both inference and training. This blog will help you pick which techniques matter for your workloads.

We benchmarked our techniques on popular GenAI models like LLama 3 and Diffusion models and saw minimal drops in accuracy. Unless otherwise noted the baselines are bf16 run on A100 80GB GPU.
Expand Down Expand Up @@ -43,6 +44,7 @@ model \= torchao.autoquant(torch.compile(model, mode='max-autotune'))

quantize\_ API has a few different options depending on whether your model is compute bound or memory bound.

```py
from torchao.quantization import (
\# Memory bound models
int4\_weight\_only,
Expand All @@ -56,7 +58,7 @@ from torchao.quantization import (
float8\_weight\_only,
float8\_dynamic\_activation\_float8\_weight,
)
```

We also have extensive benchmarks on diffusion models in collaboration with the HuggingFace diffusers team in [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao) where we demonstrated 53.88% speedup on Flux.1-Dev and 27.33% speedup on CogVideoX-5b

Expand All @@ -72,7 +74,7 @@ But also can do things like quantize weights to int4 and the kv cache to int8 to

Post training quantization, especially at less than 4 bit can suffer from serious accuracy degradations. Using [Quantization Aware Training](https://pytorch.org/blog/quantization-aware-training/) (QAT) we’ve managed to recover up to 96% of the accuracy degradation on hellaswag. We’ve integrated this as an end to end recipe in torchtune with a minimal [tutorial](https://github.com/pytorch/ao/tree/main/torchao/quantization/prototype/qat)

![](/assets/images/Figure_3.png){:style="width:100%"}
![](/assets/images/Figure_3.jpg){:style="width:100%"}

# Training

Expand Down Expand Up @@ -115,8 +117,6 @@ We’ve been actively working on making sure torchao works well in some of the m
5. In [torchchat](https://github.com/pytorch/torchchat) for post training quantization
6. In SGLang for for [int4 and int8 post training quantization](https://github.com/sgl-project/sglang/pull/1341)

#

## Conclusion

If you’re interested in making your models faster and smaller for training or inference, we hope you’ll find torchao useful and easy to integrate.
Expand Down