ipex.llm provides dedicated optimization for running Large Language Models (LLM) faster, including technical points like paged attention, ROPE fusion, etc. To further provide optimized modules or functions to help build modelings, ipex supports the following module/function level APIs:
import intel_extension_for_pytorch as ipex
#using module init and forward
ipex.llm.modules.linearMul
ipex.llm.modules.linearGelu
ipex.llm.modules.linearNewGelu
ipex.llm.modules.linearAdd
ipex.llm.modules.linearAddAdd
ipex.llm.modules.linearSilu
ipex.llm.modules.linearSiluMul
ipex.llm.modules.linear2SiluMul
ipex.llm.modules.linearRelu
#using module init and forward
ipex.llm.modules.RotaryEmbedding
ipex.llm.modules.RMSNorm
ipex.llm.modules.FastLayerNorm
ipex.llm.modules.VarlenAttention
ipex.llm.modules.PagedAttention
ipex.llm.modules.IndirectAccessKVCacheAttention
#using as functions
ipex.llm.functional.rotary_embedding
ipex.llm.functional.rms_norm
ipex.llm.functional.fast_layer_norm
ipex.llm.functional.indirect_access_kv_cache_attention
ipex.llm.functional.varlen_attention
ipex.llm.generation.hf_beam_search
ipex.llm.generation.hf_greedy_search
ipex.llm.generation.hf_sample
We provide LLAMA, GPTJ and OPT modeling as show cases that apply the optimized modules or functions from ipex.llm layers.
MODEL FAMILY | MODEL NAME (Huggingface hub) |
---|---|
LLAMA | "meta-llama/Llama-2-7b-hf", "meta-llama/Llama-2-13b-hf", etc. |
GPT-J | "EleutherAI/gpt-j-6b", etc. |
OPT | "facebook/opt-30b", "facebook/opt-1.3b", etc. |
ipex.llm provides a single script to facilitate running generation tasks as below: Note that please setup ENV according to the ../llm/README.md
python run.py --help # for more detailed usages
Key args of run.py | Notes |
---|---|
model name | use "-m MODEL_NAME" to choose models to run |
generation | default: beam search (beam size = 4), "--greedy" for greedy search |
input tokens | default: 32, provide fixed sizes for input prompt size, use "--input-tokens" for [32, 64, 128, 256, 512, 1024, 2016, 2017, 2048, 4096, 8192]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs |
output tokens | default: 32, use "--max-new-tokens" to choose any other size |
batch size | default: 1, use "--batch-size" to choose any other size |
generation iterations | use "--num-iter" and "--num-warmup" to control the repeated iterations of generation, default: 100-iter/10-warmup |
ipex prepack | apply ipex weight prepack optimization by "--use-ipex-optimize" |
profiling | enable pytorch profiling by " --profile" |
Note: You may need to log in your HuggingFace account to access the model files. Please refer to HuggingFace login.
# The following "OMP_NUM_THREADS" and "numactl" settings are based on the assumption that
# the target server has 56 physical cores per numa socket, and we benchmark with 1 socket.
# Please adjust the settings per your hardware.
# Running FP32 model
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py -m meta-llama/Llama-2-7b-hf --dtype float32 --use-ipex-optimize
# Running BF16 model
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py -m meta-llama/Llama-2-7b-hf --dtype bfloat16 --use-ipex-optimize