diff --git a/_posts/2024-12-20-improve-rag-performance.md b/_posts/2024-12-20-improve-rag-performance.md new file mode 100644 index 000000000000..2ed3cb1ee5e5 --- /dev/null +++ b/_posts/2024-12-20-improve-rag-performance.md @@ -0,0 +1,456 @@ +--- +layout: blog_detail +title: "Improve RAG performance with torch.compile on AWS Graviton Processors" +author: Sunita Nadampalli(AWS), Ankith Gunapal(Meta), Hamid Shojanazeri(Meta) +--- + +Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to support tasks like answering questions, translating languages, and completing sentences. There are a few challenges when working with LLMs such as domain knowledge gaps, factuality issues, and hallucination, which affect their reliability especially for the fields that require high levels of accuracy, such as healthcare, law, or engineering. Retrieval Augmented Generation (RAG) provides a solution to mitigate some of these issues by augmenting LLMs with a specific domain or an organization's internal knowledge base, without the need to retrain the model. + +The RAG knowledge source is generally business specific databases which are typically deployed on general-purpose CPU infrastructure. So, deploying RAG on general-purpose CPU infrastructure alongside related business services is both efficient and cost-effective. With this motivation, we evaluated RAG deployment on [AWS Graviton](https://aws.amazon.com/ec2/graviton/) based Amazon EC2 instances which have been delivering up to [40% price-performance advantage](https://aws.amazon.com/ec2/graviton/getting-started/) compared to comparable instances for the majority of the workloads including databases, in-memory caches, big data analytics, media codecs, gaming servers, and machine learning inference. + +In the past we published a few blog posts on how PyTorch was optimized for AWS Graviton processors to accelerate ML Inference performance for both eager mode ([blog](https://pytorch.org/blog/optimized-pytorch-w-graviton/)) and `torch.compile` mode ([blog](https://pytorch.org/blog/accelerated-pytorch-inference/)). In this blog we cover how to deploy a typical RAG workload using PyTorch and `torch.compile`, how we improved its performance up to **1.7x** for embedding model and **1.3x** for RAG query on AWS Graviton3-based m7g.xlarge instance compared to the default PyTorch “eager mode”, and finally a few recommendations that you can apply for your RAG use cases. + + +## How to Optimize RAG? + +Without RAG, the LLM takes the user input and creates a response based on information it was trained on (what it already knows). With RAG, an information retrieval component is introduced that utilizes the user input to first pull information from a new data source. The user query and the relevant information are both given to the LLM. The LLM uses the new knowledge and its training data to create better responses. The following diagram shows the conceptual flow of using RAG with LLMs. + + + +{:style="width:100%"} + + +**Image 1**: Conceptual flow of using RAG with LLMs + +Source:[ https://aws.amazon.com/what-is/retrieval-augmented-generation/](https://aws.amazon.com/what-is/retrieval-augmented-generation/) + + +### Embedding model + +At the core of RAG is an embedding model that takes the text data and converts into a vector representation. These vectors are then stored in a vector db. When a user makes a query, the query is first converted to a vector and the RAG does a similarity search on the vector db. Hence, the first step in optimizing RAG performance is optimizing an embedding model’s inference performance. We used the AWS Graviton3-based m7g.xlarge instance and the HuggingFace sentence-transformer embedding model for the optimization work. Here is a sample script for profiling the HuggingFace sentence-transformer embedding model inference with PyTorch Eager mode. + + +``` +import torch +from torch.profiler import profile, ProfilerActivity, record_function +from transformers import AutoModel, AutoTokenizer + +model_name = "sentence-transformers/all-mpnet-base-v2" +input_text = ["This is an example sentence", "Each sentence is converted"] + +model = AutoModel.from_pretrained(model_name) +tokenizer = AutoTokenizer.from_pretrained(model_name) + +encoded_input = tokenizer( + input_text, padding=True, truncation=True, return_tensors="pt" +) + +warmup, actual = 100, 100 +model.eval() + +with torch.no_grad(): + # warmup + for i in range(warmup): + embeddings = model(**encoded_input) + + with profile(activities=[ProfilerActivity.CPU]) as prof: + with record_function("model_inference"): + for i in range(actual): + embeddings = model(**encoded_input) + print(prof.key_averages().table(sort_by="self_cpu_time_total")) +``` + + + +#### Eager mode + +Since PyTorch eager mode was already optimized on AWS Graviton processors with the following runtime environment settings, we included them in the baseline and measured the following performance. Please refer to [Optimized PyTorch 2.0 Inference with AWS Graviton processors](https://pytorch.org/blog/optimized-pytorch-w-graviton/) for more details on how we optimized the PyTorch eager mode on AWS Graviton processors. + + +``` +# Enable the fast math GEMM kernels, to accelerate fp32 inference with bfloat16 gemm +export DNNL_DEFAULT_FPMATH_MODE=BF16 + +# Enable Linux Transparent Huge Page (THP) allocations, +# to reduce the tensor memory allocation latency +export THP_MEM_ALLOC_ENABLE=1 + +# Set LRU Cache capacity to cache the primitives and avoid redundant +# memory allocations +export LRU_CACHE_CAPACITY=1024 +``` + + + +``` +--------------------------- ------------ ------------ ------------ ------------ ------------ ------------ + Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls +--------------------------- ------------ ------------ ------------ ------------ ------------ ------------ + aten::addmm 61.01% 2.638s 62.49% 2.702s 370.197us 7300 + model_inference 12.01% 519.161ms 100.00% 4.324s 4.324s 1 + aten::bmm 6.25% 270.084ms 11.96% 517.089ms 215.454us 2400 + aten::select 3.98% 172.165ms 5.34% 230.863ms 1.331us 173500 + aten::copy_ 2.11% 91.133ms 2.11% 91.133ms 6.200us 14700 +--------------------------- ------------ ------------ ------------ ------------ ------------ ------------ +Self CPU time total: 4.324s +``` + + +**Table 1:** Profiler output for HuggingFace sentence-transformer embedding model inference on AWS Graviton3-based m7g.xlarge instance with PyTorch Eager mode + +Next, we added `torch.compile`, [weights pre-packing](https://pytorch.org/blog/accelerated-pytorch-inference/#technical-deep-dive-what-are-the-challenges-and-optimization-details), and `torch.inference_mode` and observed around 1.7x performance improvement. The following section talks about each of these optimizations and the resulting speedup. + + +#### torch.compile + +In contrast to eager mode, the `torch.compile` pre-compiles the entire model into a single graph in a manner that’s optimized for running on given hardware. Please refer to [Accelerated PyTorch Inference with torch.compile on AWS Graviton processors](https://pytorch.org/blog/accelerated-pytorch-inference/) for more details on `torch.compile` features and how we optimized them on AWS Graviton processors. Invoke `torch.compile` as shown in the following snippet to trigger PyTorch dynamo compilation for the model. This resulted in around 1.04x performance improvement from the baseline. + + +``` +model = torch.compile(model) + +---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ + Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls +---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ + aten::addmm 64.46% 2.675s 66.66% 2.766s 378.905us 7300 + Torch-Compiled Region 19.76% 820.085ms 99.04% 4.109s 41.094ms 100 + aten::bmm 6.66% 276.216ms 12.52% 519.527ms 216.470us 2400 + aten::select 3.98% 164.991ms 5.41% 224.488ms 1.299us 172800 + aten::as_strided 1.66% 69.039ms 1.66% 69.039ms 0.383us 180100 +---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ +Self CPU time total: 4.149s +``` + + +**Table 2:** Profiler output for HuggingFace sentence-transformer embedding model inference on AWS Graviton3-based m7g.xlarge instance with torch.compile mode + + +#### Weights pre-packing + +`torch.compile` opens up opportunities like pre-packing the model weights into a format that is more suitable for the given hardware during the model compilation, thus improving the performance. Set the following config to trigger weights pre-packing. This resulted in around 1.69x improvement from the baseline. + + +``` +import torch._inductor.config as config +config.cpp.weight_prepack=True +config.freezing=True +``` + + + +``` +----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ + Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls +----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ + mkldnn::_linear_pointwise 39.10% 994.821ms 41.50% 1.056s 144.628us 7300 + Torch-Compiled Region 35.12% 893.675ms 98.42% 2.504s 25.043ms 100 + aten::bmm 10.96% 278.859ms 21.66% 551.073ms 229.614us 2400 + aten::select 7.34% 186.838ms 9.98% 253.840ms 1.469us 172800 + aten::as_strided 2.63% 67.002ms 2.63% 67.002ms 0.388us 172800 +----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ +Self CPU time total: 2.544s +``` + + +**Table 3:** Profiler output for HuggingFace sentence-transformer embedding model inference on AWS Graviton3-based m7g.xlarge instance with torch.compile and weights pre-packing + + +#### torch.inference_mode + +Additionally, use `torch.inference_mode()` to get savings from turning off version control for tensors and view tracking of tensors. Please refer to the PyTorch[ documentation](https://pytorch.org/docs/stable/generated/torch.autograd.grad_mode.inference_mode.html) for more details. + + +``` +with torch.inference_mode(): +# instead of +with torch.no_grad(): +``` + + + +``` +----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ + Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls +----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ + mkldnn::_linear_pointwise 38.92% 987.276ms 41.17% 1.044s 143.056us 7300 + Torch-Compiled Region 34.92% 885.895ms 98.45% 2.498s 24.975ms 100 + aten::bmm 11.25% 285.292ms 22.22% 563.594ms 234.831us 2400 + aten::select 7.74% 196.223ms 10.22% 259.251ms 1.500us 172800 + aten::as_strided 2.48% 63.027ms 2.48% 63.027ms 0.365us 172800 +----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ +Self CPU time total: 2.537s +``` + + +**Table 4:** Profiler output for HuggingFace sentence-transformer embedding model inference on AWS Graviton3-based m7g.xlarge instance with torch.compile, weights pre-packing, and inference_mode + +The following table shows the incremental performance improvements achieved for the standalone embedding model inference. + + +
Optimization level + | +Latency measured (in sec) + | +Improvement over the baseline + | +
PyTorch eager mode (Baseline) + | +0.04324 + | +NA + | +
torch.compile + | +0.04149 + | +1.04x + | +
weights pre-packing + | +0.02544 + | +1.69x + | +
torch.inference_mode + | +0.02537 + | +1.70x + | +
+import torch +from torch.profiler import profile, record_function, ProfilerActivity +from transformers import AutoTokenizer, AutoModel +import torch._inductor.config as config +config.cpp.weight_prepack=True +config.freezing=True + +model_name = "sentence-transformers/all-mpnet-base-v2" +input_text = ['This is an example sentence', 'Each sentence is converted'] + +model = AutoModel.from_pretrained(model_name) +tokenizer = AutoTokenizer.from_pretrained(model_name) + +encoded_input = tokenizer(input_text, padding=True, truncation=True, return_tensors='pt') + +warmup , actual = 100, 100 +model.eval() +model = torch.compile(model) + +with torch.inference_mode(): +#instead of with torch.no_grad() +# warmup + for i in range(warmup): + embeddings = model(**encoded_input) + + with profile(activities=[ProfilerActivity.CPU]) as prof: + with record_function("model_inference"): + for i in range(actual): + embeddings = model(**encoded_input) + print(prof.key_averages().table(sort_by="self_cpu_time_total")) ++