diff --git a/cpu/2.4.0+cpu/.buildinfo b/cpu/2.4.0+cpu/.buildinfo new file mode 100644 index 000000000..859392613 --- /dev/null +++ b/cpu/2.4.0+cpu/.buildinfo @@ -0,0 +1,4 @@ +# Sphinx build info version 1 +# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. +config: ce5e33ee2857ff353429c5c71f5ead41 +tags: 645f666f9bcd5a90fca523b33c5a78b7 diff --git a/cpu/2.4.0+cpu/_images/1ins_cus.gif b/cpu/2.4.0+cpu/_images/1ins_cus.gif new file mode 100644 index 000000000..0cc624759 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/1ins_cus.gif differ diff --git a/cpu/2.4.0+cpu/_images/1ins_log.gif b/cpu/2.4.0+cpu/_images/1ins_log.gif new file mode 100644 index 000000000..54265be8e Binary files /dev/null and b/cpu/2.4.0+cpu/_images/1ins_log.gif differ diff --git a/cpu/2.4.0+cpu/_images/1ins_phy.gif b/cpu/2.4.0+cpu/_images/1ins_phy.gif new file mode 100644 index 000000000..bae9fc592 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/1ins_phy.gif differ diff --git a/cpu/2.4.0+cpu/_images/1ins_soc.gif b/cpu/2.4.0+cpu/_images/1ins_soc.gif new file mode 100644 index 000000000..298ec10e2 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/1ins_soc.gif differ diff --git a/cpu/2.4.0+cpu/_images/GenAI-bf16.gif b/cpu/2.4.0+cpu/_images/GenAI-bf16.gif new file mode 100644 index 000000000..2f92f3f3c Binary files /dev/null and b/cpu/2.4.0+cpu/_images/GenAI-bf16.gif differ diff --git a/cpu/2.4.0+cpu/_images/GenAI-int8.gif b/cpu/2.4.0+cpu/_images/GenAI-int8.gif new file mode 100644 index 000000000..458659444 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/GenAI-int8.gif differ diff --git a/cpu/2.4.0+cpu/_images/autotp_bf16_llama.gif b/cpu/2.4.0+cpu/_images/autotp_bf16_llama.gif new file mode 100644 index 000000000..67b03e030 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/autotp_bf16_llama.gif differ diff --git a/cpu/2.4.0+cpu/_images/autotp_woq_int8_llama.gif b/cpu/2.4.0+cpu/_images/autotp_woq_int8_llama.gif new file mode 100644 index 000000000..ae69dff57 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/autotp_woq_int8_llama.gif differ diff --git a/cpu/2.4.0+cpu/_images/bf16_llama.gif b/cpu/2.4.0+cpu/_images/bf16_llama.gif new file mode 100644 index 000000000..3b28c1531 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/bf16_llama.gif differ diff --git a/cpu/2.4.0+cpu/_images/block_diagram_xeon_architecture.png b/cpu/2.4.0+cpu/_images/block_diagram_xeon_architecture.png new file mode 100644 index 000000000..bf0cdcf52 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/block_diagram_xeon_architecture.png differ diff --git a/cpu/2.4.0+cpu/_images/figure1_memory_layout.png b/cpu/2.4.0+cpu/_images/figure1_memory_layout.png new file mode 100644 index 000000000..b37006bbc Binary files /dev/null and b/cpu/2.4.0+cpu/_images/figure1_memory_layout.png differ diff --git a/cpu/2.4.0+cpu/_images/figure2_dispatch.png b/cpu/2.4.0+cpu/_images/figure2_dispatch.png new file mode 100644 index 000000000..0f35d2a93 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/figure2_dispatch.png differ diff --git a/cpu/2.4.0+cpu/_images/figure3_strided_layout.png b/cpu/2.4.0+cpu/_images/figure3_strided_layout.png new file mode 100644 index 000000000..fc7491254 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/figure3_strided_layout.png differ diff --git a/cpu/2.4.0+cpu/_images/hypertune.png b/cpu/2.4.0+cpu/_images/hypertune.png new file mode 100644 index 000000000..1119ec1e3 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/hypertune.png differ diff --git a/cpu/2.4.0+cpu/_images/int8_pattern.png b/cpu/2.4.0+cpu/_images/int8_pattern.png new file mode 100644 index 000000000..168343eb0 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/int8_pattern.png differ diff --git a/cpu/2.4.0+cpu/_images/intel_extension_for_pytorch_structure.png b/cpu/2.4.0+cpu/_images/intel_extension_for_pytorch_structure.png new file mode 100644 index 000000000..7b3dd6b32 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/intel_extension_for_pytorch_structure.png differ diff --git a/cpu/2.4.0+cpu/_images/kmp_affinity.jpg b/cpu/2.4.0+cpu/_images/kmp_affinity.jpg new file mode 100644 index 000000000..1993c05cf Binary files /dev/null and b/cpu/2.4.0+cpu/_images/kmp_affinity.jpg differ diff --git a/cpu/2.4.0+cpu/_images/llm_iakv_1.png b/cpu/2.4.0+cpu/_images/llm_iakv_1.png new file mode 100644 index 000000000..91489686c Binary files /dev/null and b/cpu/2.4.0+cpu/_images/llm_iakv_1.png differ diff --git a/cpu/2.4.0+cpu/_images/llm_iakv_2.png b/cpu/2.4.0+cpu/_images/llm_iakv_2.png new file mode 100644 index 000000000..3e3d78a17 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/llm_iakv_2.png differ diff --git a/cpu/2.4.0+cpu/_images/m7i_m6i_comp_gptj6b.png b/cpu/2.4.0+cpu/_images/m7i_m6i_comp_gptj6b.png new file mode 100644 index 000000000..afd69dd01 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/m7i_m6i_comp_gptj6b.png differ diff --git a/cpu/2.4.0+cpu/_images/m7i_m6i_comp_llama13b.png b/cpu/2.4.0+cpu/_images/m7i_m6i_comp_llama13b.png new file mode 100644 index 000000000..877e231b3 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/m7i_m6i_comp_llama13b.png differ diff --git a/cpu/2.4.0+cpu/_images/m7i_m6i_comp_llama7b.png b/cpu/2.4.0+cpu/_images/m7i_m6i_comp_llama7b.png new file mode 100644 index 000000000..22d127dad Binary files /dev/null and b/cpu/2.4.0+cpu/_images/m7i_m6i_comp_llama7b.png differ diff --git a/cpu/2.4.0+cpu/_images/nins_cus.gif b/cpu/2.4.0+cpu/_images/nins_cus.gif new file mode 100644 index 000000000..076486dc7 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/nins_cus.gif differ diff --git a/cpu/2.4.0+cpu/_images/nins_lat.gif b/cpu/2.4.0+cpu/_images/nins_lat.gif new file mode 100644 index 000000000..8e4c2e265 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/nins_lat.gif differ diff --git a/cpu/2.4.0+cpu/_images/nins_thr.gif b/cpu/2.4.0+cpu/_images/nins_thr.gif new file mode 100644 index 000000000..4435b6a41 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/nins_thr.gif differ diff --git a/cpu/2.4.0+cpu/_images/smoothquant_int8_llama.gif b/cpu/2.4.0+cpu/_images/smoothquant_int8_llama.gif new file mode 100644 index 000000000..32ed72476 Binary files /dev/null and b/cpu/2.4.0+cpu/_images/smoothquant_int8_llama.gif differ diff --git a/cpu/2.4.0+cpu/_images/split_sgd.png b/cpu/2.4.0+cpu/_images/split_sgd.png new file mode 100644 index 000000000..2b54e3f8e Binary files /dev/null and b/cpu/2.4.0+cpu/_images/split_sgd.png differ diff --git a/cpu/2.4.0+cpu/_images/two_socket_config.png b/cpu/2.4.0+cpu/_images/two_socket_config.png new file mode 100644 index 000000000..f9f562c4c Binary files /dev/null and b/cpu/2.4.0+cpu/_images/two_socket_config.png differ diff --git a/cpu/2.4.0+cpu/_images/woq_int4_gptj.gif b/cpu/2.4.0+cpu/_images/woq_int4_gptj.gif new file mode 100644 index 000000000..d59fa982f Binary files /dev/null and b/cpu/2.4.0+cpu/_images/woq_int4_gptj.gif differ diff --git a/cpu/2.4.0+cpu/_images/woq_int8_llama.gif b/cpu/2.4.0+cpu/_images/woq_int8_llama.gif new file mode 100644 index 000000000..113e2a2af Binary files /dev/null and b/cpu/2.4.0+cpu/_images/woq_int8_llama.gif differ diff --git a/cpu/2.4.0+cpu/_sources/design_doc/cpu/isa_dyndisp.md.txt b/cpu/2.4.0+cpu/_sources/design_doc/cpu/isa_dyndisp.md.txt new file mode 100644 index 000000000..9dd9dc150 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/design_doc/cpu/isa_dyndisp.md.txt @@ -0,0 +1,3 @@ +# Intel® Extension for PyTorch\* CPU ISA Dynamic Dispatch Design Doc + +The design document has been merged with [the ISA Dynamic Dispatch feature introduction](../../tutorials/features/isa_dynamic_dispatch.md). \ No newline at end of file diff --git a/cpu/2.4.0+cpu/_sources/index.rst.txt b/cpu/2.4.0+cpu/_sources/index.rst.txt new file mode 100644 index 000000000..ca0ec0a98 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/index.rst.txt @@ -0,0 +1,100 @@ +.. meta:: + :description: This website introduces Intel® Extension for PyTorch* + :keywords: Intel optimization, PyTorch, Intel® Extension for PyTorch*, GPU, discrete GPU, Intel discrete GPU + +Intel® Extension for PyTorch* +############################# + +Intel® Extension for PyTorch* extends PyTorch* with the latest performance optimizations for Intel hardware. +Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel X\ :sup:`e`\ Matrix Extensions (XMX) AI engines on Intel discrete GPUs. +Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* ``xpu`` device. + +In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Starting from 2.1.0, specific optimizations for certain +LLMs are introduced in the Intel® Extension for PyTorch*. For more information on LLM optimizations, refer to the `Large Language Models (LLM) `_ section. + +The extension can be loaded as a Python module for Python programs or linked as a C++ library for C++ programs. In Python scripts, users can enable it dynamically by importing ``intel_extension_for_pytorch``. + +.. note:: + + - GPU features are not included in CPU-only packages. + - Optimizations for CPU-only may have a newer code base due to different development schedules. + +Intel® Extension for PyTorch* has been released as an open–source project at `Github `_. You can find the source code and instructions on how to get started at: + +- **CPU**: `CPU main branch `_ | `Quick Start `_ +- **XPU**: `XPU main branch `_ | `Quick Start `_ + +You can find more information about the product at: + +- `Features `_ +- `Performance `_ + +Architecture +------------ + +Intel® Extension for PyTorch* is structured as shown in the following figure: + +.. figure:: ../images/intel_extension_for_pytorch_structure.png + :width: 800 + :align: center + :alt: Architecture of Intel® Extension for PyTorch* + + Architecture of Intel® Extension for PyTorch* + +- **Eager Mode**: In the eager mode, the PyTorch frontend is extended with custom Python modules (such as fusion modules), optimal optimizers, and INT8 quantization APIs. Further performance improvement is achieved by converting eager-mode models into graph mode using extended graph fusion passes. +- **Graph Mode**: In the graph mode, fusions reduce operator/kernel invocation overhead, resulting in improved performance. Compared to the eager mode, the graph mode in PyTorch* normally yields better performance from the optimization techniques like operation fusion. Intel® Extension for PyTorch* amplifies them with more comprehensive graph optimizations. Both PyTorch ``Torchscript`` and ``TorchDynamo`` graph modes are supported. With ``Torchscript``, we recommend using ``torch.jit.trace()`` as your preferred option, as it generally supports a wider range of workloads compared to ``torch.jit.script()``. With ``TorchDynamo``, ipex backend is available to provide good performances. +- **CPU Optimization**: On CPU, Intel® Extension for PyTorch* automatically dispatches operators to underlying kernels based on detected instruction set architecture (ISA). The extension leverages vectorization and matrix acceleration units available on Intel hardware. The runtime extension offers finer-grained thread runtime control and weight sharing for increased efficiency. +- **GPU Optimization**: On GPU, optimized operators and kernels are implemented and registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel GPU hardware. Intel® Extension for PyTorch* for GPU utilizes the `DPC++ `_ compiler that supports the latest `SYCL* `_ standard and also a number of extensions to the SYCL* standard, which can be found in the `sycl/doc/extensions `_ directory. + + +Support +------- +The team tracks bugs and enhancement requests using `GitHub issues `_. Before submitting a suggestion or bug report, search the existing GitHub issues to see if your issue has already been reported. + +.. toctree:: + :caption: ABOUT + :maxdepth: 3 + :hidden: + + tutorials/introduction + tutorials/features + Large Language Models (LLM) + tutorials/performance + tutorials/releases + tutorials/known_issues + tutorials/blogs_publications + tutorials/license + +.. toctree:: + :maxdepth: 3 + :caption: GET STARTED + :hidden: + + tutorials/installation + tutorials/getting_started + tutorials/examples + tutorials/cheat_sheet + +.. toctree:: + :maxdepth: 3 + :caption: DEVELOPER REFERENCE + :hidden: + + tutorials/api_doc + +.. toctree:: + :maxdepth: 3 + :caption: PERFORMANCE TUNING + :hidden: + + tutorials/performance_tuning/tuning_guide + tutorials/performance_tuning/launch_script + tutorials/performance_tuning/torchserve + +.. toctree:: + :maxdepth: 3 + :caption: CONTRIBUTING GUIDE + :hidden: + + tutorials/contribution + diff --git a/cpu/2.4.0+cpu/_sources/tutorials/api_doc.rst.txt b/cpu/2.4.0+cpu/_sources/tutorials/api_doc.rst.txt new file mode 100644 index 000000000..1a161c0a3 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/api_doc.rst.txt @@ -0,0 +1,124 @@ +API Documentation +################# + +General +******* + +`ipex.optimize` is generally used for generic PyTorch models. + +.. automodule:: intel_extension_for_pytorch +.. autofunction:: optimize + + +`ipex.llm.optimize` is used for Large Language Models (LLM). + +.. automodule:: intel_extension_for_pytorch.llm +.. autofunction:: optimize + +.. currentmodule:: intel_extension_for_pytorch +.. autoclass:: verbose + +LLM Module Level Optimizations (Prototype) +****************************************** + +Module level optimization APIs are provided for optimizing customized LLMs. + +.. automodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: LinearSilu + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: LinearSiluMul + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: Linear2SiluMul + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: LinearRelu + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: LinearNewGelu + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: LinearGelu + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: LinearMul + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: LinearAdd + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: LinearAddAdd + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: RotaryEmbedding + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: RMSNorm + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: FastLayerNorm + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: IndirectAccessKVCacheAttention + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: PagedAttention + +.. currentmodule:: intel_extension_for_pytorch.llm.modules +.. autoclass:: VarlenAttention + +.. automodule:: intel_extension_for_pytorch.llm.functional +.. autofunction:: rotary_embedding + +.. currentmodule:: intel_extension_for_pytorch.llm.functional +.. autofunction:: rms_norm + +.. currentmodule:: intel_extension_for_pytorch.llm.functional +.. autofunction:: fast_layer_norm + +.. currentmodule:: intel_extension_for_pytorch.llm.functional +.. autofunction:: indirect_access_kv_cache_attention + +.. currentmodule:: intel_extension_for_pytorch.llm.functional +.. autofunction:: varlen_attention + +Fast Bert (Prototype) +************************ + +.. currentmodule:: intel_extension_for_pytorch +.. autofunction:: fast_bert + +Graph Optimization +****************** + +.. currentmodule:: intel_extension_for_pytorch +.. autofunction:: enable_onednn_fusion + +Quantization +************ + +.. automodule:: intel_extension_for_pytorch.quantization +.. autofunction:: get_smooth_quant_qconfig_mapping +.. autofunction:: get_weight_only_quant_qconfig_mapping +.. autofunction:: prepare +.. autofunction:: convert + +Prototype API, introduction is avaiable at `feature page <./features/int8_recipe_tuning_api.md>`_. + +.. autofunction:: autotune + +CPU Runtime +*********** + +.. automodule:: intel_extension_for_pytorch.cpu.runtime +.. autofunction:: is_runtime_ext_enabled +.. autoclass:: CPUPool +.. autoclass:: pin +.. autoclass:: MultiStreamModuleHint +.. autoclass:: MultiStreamModule +.. autoclass:: Task +.. autofunction:: get_core_list_of_node_id + +.. .. automodule:: intel_extension_for_pytorch.quantization +.. :members: diff --git a/cpu/2.4.0+cpu/_sources/tutorials/blogs_publications.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/blogs_publications.md.txt new file mode 100644 index 000000000..8a60f8ff3 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/blogs_publications.md.txt @@ -0,0 +1,39 @@ +Blogs & Publications +==================== + +* [Accelerate Llama 2 with Intel AI Hardware and Software Optimizations, Jul 2023](https://www.intel.com/content/www/us/en/developer/articles/news/llama2.html) +* [Accelerate PyTorch\* Training and Inference Performance using Intel® AMX, Jul 2023](https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-pytorch-training-inference-on-amx.html) +* [Intel® Deep Learning Boost (Intel® DL Boost) - Improve Inference Performance of Hugging Face BERT Base Model in Google Cloud Platform (GCP) Technology Guide, Apr 2023](https://networkbuilders.intel.com/solutionslibrary/intel-deep-learning-boost-intel-dl-boost-improve-inference-performance-of-hugging-face-bert-base-model-in-google-cloud-platform-gcp-technology-guide) +* [Get Started with Intel® Extension for PyTorch\* on GPU | Intel Software, Mar 2023](https://www.youtube.com/watch?v=Id-rE2Q7xZ0&t=1s) +* [Accelerate PyTorch\* INT8 Inference with New “X86” Quantization Backend on X86 CPUs, Mar 2023](https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-pytorch-int8-inf-with-new-x86-backend.html) +* [Accelerating PyTorch Transformers with Intel Sapphire Rapids, Part 1, Jan 2023](https://huggingface.co/blog/intel-sapphire-rapids) +* [Intel® Deep Learning Boost - Improve Inference Performance of BERT Base Model from Hugging Face for Network Security Technology Guide, Jan 2023](https://networkbuilders.intel.com/solutionslibrary/intel-deep-learning-boost-improve-inference-performance-of-bert-base-model-from-hugging-face-for-network-security-technology-guide) +* [Scaling inference on CPUs with TorchServe, PyTorch Conference, Dec 2022](https://www.youtube.com/watch?v=066_Jd6cwZg) +* [What is New in Intel Extension for PyTorch, PyTorch Conference, Dec 2022](https://www.youtube.com/watch?v=SE56wFXdvP4&t=1s) +* [Accelerating PyG on Intel CPUs, Dec 2022](https://www.pyg.org/ns-newsarticle-accelerating-pyg-on-intel-cpus) +* [Accelerating PyTorch Deep Learning Models on Intel XPUs, Dec, 2022](https://www.oneapi.io/event-sessions/accelerating-pytorch-deep-learning-models-on-intel-xpus-2-ai-hpc-2022/) +* [Introducing the Intel® Extension for PyTorch\* for GPUs, Dec 2022](https://www.intel.com/content/www/us/en/developer/articles/technical/introducing-intel-extension-for-pytorch-for-gpus.html) +* [PyTorch Stable Diffusion Using Hugging Face and Intel Arc, Nov 2022](https://towardsdatascience.com/pytorch-stable-diffusion-using-hugging-face-and-intel-arc-77010e9eead6) +* [PyTorch 1.13: New Potential for AI Developers to Enhance Model Performance and Accuracy, Nov 2022](https://www.intel.com/content/www/us/en/developer/articles/technical/pytorch-1-13-new-potential-for-ai-developers.html) +* [Easy Quantization in PyTorch Using Fine-Grained FX, Sep 2022](https://medium.com/intel-analytics-software/easy-quantization-in-pytorch-using-fine-grained-fx-80be2c4bc2d6) +* [Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16, Aug 2022](https://pytorch.org/blog/empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16/) +* [Accelerating PyTorch Vision Models with Channels Last on CPU, Aug 2022](https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/) +* [One-Click Enabling of Intel Neural Compressor Features in PyTorch Scripts, Aug 2022](https://medium.com/intel-analytics-software/one-click-enable-intel-neural-compressor-features-in-pytorch-scripts-5d4e31f5a22b) +* [Increase PyTorch Inference Throughput by 4x, Jul 2022](https://www.intel.com/content/www/us/en/developer/articles/technical/increase-pytorch-inference-throughput-by-4x.html) +* [PyTorch Inference Acceleration with Intel® Neural Compressor, Jun 2022](https://medium.com/pytorch/pytorch-inference-acceleration-with-intel-neural-compressor-842ef4210d7d) +* [Accelerating PyTorch with Intel® Extension for PyTorch, May 2022](https://medium.com/pytorch/accelerating-pytorch-with-intel-extension-for-pytorch-3aef51ea3722) +* [Grokking PyTorch Intel CPU performance from first principles (parts 1), Apr 2022](https://pytorch.org/tutorials/intermediate/torchserve_with_ipex.html) +* [Grokking PyTorch Intel CPU performance from first principles (parts 2), Apr 2022](https://pytorch.org/tutorials/intermediate/torchserve_with_ipex_2.html) +* [Grokking PyTorch Intel CPU performance from first principles, Apr 2022](https://medium.com/pytorch/grokking-pytorch-intel-cpu-performance-from-first-principles-7e39694412db) +* [KT Optimizes Performance for Personalized Text-to-Speech, Nov 2021](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/KT-Optimizes-Performance-for-Personalized-Text-to-Speech/post/1337757) +* [Accelerating PyTorch distributed fine-tuning with Intel technologies, Nov 2021](https://huggingface.co/blog/accelerating-pytorch) +* [Scaling up BERT-like model Inference on modern CPU - parts 1, Apr 2021](https://huggingface.co/blog/bert-cpu-scaling-part-1) +* [Scaling up BERT-like model Inference on modern CPU - parts 2, Nov 2021](https://huggingface.co/blog/bert-cpu-scaling-part-2) +* [NAVER: Low-Latency Machine-Learning Inference](https://www.intel.com/content/www/us/en/customer-spotlight/stories/naver-ocr-customer-story.html) +* [Intel® Extensions for PyTorch, Feb 2021](https://pytorch.org/tutorials/recipes/recipes/intel_extension_for_pytorch.html) +* [Optimizing DLRM by using PyTorch with oneCCL Backend, Feb 2021](https://pytorch.medium.com/optimizing-dlrm-by-using-pytorch-with-oneccl-backend-9f85b8ef6929) +* [Accelerate PyTorch with IPEX and oneDNN using Intel BF16 Technology, Feb 2021](https://medium.com/pytorch/accelerate-pytorch-with-ipex-and-onednn-using-intel-bf16-technology-dca5b8e6b58f) + *Note*: APIs mentioned in it are deprecated. +* [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability, Jun 2020](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-and-Facebook-Accelerate-PyTorch-Performance-with-3rd-Gen/post/1335659) +* [Intel and Facebook\* collaborate to boost PyTorch\* CPU performance, Apr 2019](https://www.intel.com/content/www/us/en/developer/articles/case-study/intel-and-facebook-collaborate-to-boost-pytorch-cpu-performance.html) +* [Intel and Facebook\* Collaborate to Boost Caffe\*2 Performance on Intel CPU’s, Apr 2017](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-and-facebook-collaborate-to-boost-caffe2-performance-on-intel-cpu-s.html) diff --git a/cpu/2.4.0+cpu/_sources/tutorials/cheat_sheet.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/cheat_sheet.md.txt new file mode 100644 index 000000000..d7b6b1306 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/cheat_sheet.md.txt @@ -0,0 +1,21 @@ +Cheat Sheet +=========== + +Get started with Intel® Extension for PyTorch\* using the following commands: + +|Description | Command | +| -------- | ------- | +| Basic CPU Installation | `python -m pip install intel_extension_for_pytorch` | +| Import Intel® Extension for PyTorch\* | `import intel_extension_for_pytorch as ipex`| +| Capture a Verbose Log (Command Prompt) | `export ONEDNN_VERBOSE=1` | +| Optimization During Training | `model = ...`
`optimizer = ...`
`model.train()`
`model, optimizer = ipex.optimize(model, optimizer=optimizer)`| +| Optimization During Inference | `model = ...`
`model.eval()`
`model = ipex.optimize(model)` | +| Optimization Using the Low-Precision Data Type bfloat16
During Training (Default FP32) | `model = ...`
`optimizer = ...`
`model.train()`

`model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)`

`with torch.no_grad():`
` with torch.cpu.amp.autocast():`
` model(data)` | +| Optimization Using the Low-Precision Data Type bfloat16
During Inference (Default FP32) | `model = ...`
`model.eval()`

`model = ipex.optimize(model, dtype=torch.bfloat16)`

`with torch.cpu.amp.autocast():`
` model(data)` +| [Prototype] Fast BERT Optimization | `from transformers import BertModel`
`model = BertModel.from_pretrained("bert-base-uncased")`
`model.eval()`

`model = ipex.fast_bert(model, dtype=torch.bfloat16)`| +| Run CPU Launch Script (Command Prompt):
Automate Configuration Settings for Performance | `ipexrun [knobs] [args]`| +| [Prototype] Run HyperTune to perform hyperparameter/execution configuration search | `python -m intel_extension_for_pytorch.cpu.hypertune --conf-file [args]`| +| [Prototype] Enable Graph capture | `model = …`
`model.eval()`
`model = ipex.optimize(model, graph_mode=True)`| +| Post-Training INT8 Quantization (Static) | `model = …`
`model.eval()`
`data = …`

`qconfig = ipex.quantization.default_static_qconfig`

`prepared_model = ipex.quantization.prepare(model, qconfig, example_inputs=data, anyplace=False)`

`for d in calibration_data_loader():`
` prepared_model(d)`

`converted_model = ipex.quantization.convert(prepared_model)`| +| Post-Training INT8 Quantization (Dynamic) | `model = …`
`model.eval()`
`data = …`

`qconfig = ipex.quantization.default_dynamic_qconfig`

`prepared_model = ipex.quantization.prepare(model, qconfig, example_inputs=data)`

`converted_model = ipex.quantization.convert(prepared_model)` | +| [Prototype] Post-Training INT8 Quantization (Tuning Recipe): | `model = …`
`model.eval()`
`data = …`

`qconfig = ipex.quantization.default_static_qconfig`

`prepared_model = ipex.quantization.prepare(model, qconfig, example_inputs=data, inplace=False)`

`tuned_model = ipex.quantization.autotune(prepared_model, calibration_data_loader, eval_function, sampling_sizes=[100],`
` accuracy_criterion={'relative': .01}, tuning_time=0)`

`convert_model = ipex.quantization.convert(tuned_model)`| diff --git a/cpu/2.4.0+cpu/_sources/tutorials/contribution.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/contribution.md.txt new file mode 100644 index 000000000..94c7bed35 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/contribution.md.txt @@ -0,0 +1,200 @@ +Contribution +============ + +## Contributing to Intel® Extension for PyTorch\* + +Thank you for your interest in contributing to Intel® Extension for PyTorch\*. Before you begin writing code, it is important that you share your intention to contribute with the team, based on the type of contribution: + +1. You want to propose a new feature and implement it. + - Post about your intended feature in a [GitHub issue](https://github.com/intel/intel-extension-for-pytorch/issues), and we shall discuss the design and implementation. Once we agree that the plan looks good, go ahead and implement it. +2. You want to implement a feature or bug-fix for an outstanding issue. + - Search for your issue in the [GitHub issue list](https://github.com/intel/intel-extension-for-pytorch/issues). + - Pick an issue and comment that you'd like to work on the feature or bug-fix. + - If you need more context on a particular issue, ask and we shall provide. + +Once you implement and test your feature or bug-fix, submit a Pull Request to https://github.com/intel/intel-extension-for-pytorch. + +## Developing Intel® Extension for PyTorch\* + +A full set of instructions on installing Intel® Extension for PyTorch\* from source is in the [Installation document](installation.md#install-via-source-compilation). + +To develop on your machine, here are some tips: + +1. Uninstall all existing Intel® Extension for PyTorch\* installs. You may need to run `pip uninstall intel_extension_for_pytorch` multiple times. You'll know `intel_extension_for_pytorch` is fully uninstalled when you see `WARNING: Skipping intel_extension_for_pytorch as it is not installed`. (You should only have to `pip uninstall` a few times, but you can always `uninstall` with `timeout` or in a loop if you're feeling lazy.) + + ```bash + yes | pip uninstall intel_extension_for_pytorch + ``` + +2. Clone a copy of Intel® Extension for PyTorch\* from source: + + ```bash + git clone https://github.com/intel/intel-extension-for-pytorch.git + cd intel-extension-for-pytorch + ``` + + If you already have Intel® Extension for PyTorch\* from source, update it: + + ```bash + git pull --rebase + git submodule sync --recursive + git submodule update --init --recursive --jobs 0 + ``` + +3. Install Intel® Extension for PyTorch\* in `develop` mode: + + Replace: + + ```bash + python setup.py install + ``` + + with: + + ```bash + python setup.py develop + ``` + + This mode will symlink the Python files from the current local source tree into the Python install. After than, if you modify a Python file, you do not need to reinstall PyTorch again. This is especially useful if you are only changing Python files. + + For example: + - Install local Intel® Extension for PyTorch\* in `develop` mode + - modify your Python file `intel_extension_for_pytorch/__init__.py` (for example) + - test functionality + +You do not need to repeatedly install after modifying Python files (`.py`). However, you would need to reinstall if you modify a Python interface (`.pyi`, `.pyi.in`) or non-Python files (`.cpp`, `.cc`, `.cu`, `.h`, etc.). + +If you want to reinstall, make sure that you uninstall Intel® Extension for PyTorch\* first by running `pip uninstall intel_extension_for_pytorch` until you see `WARNING: Skipping intel_extension_for_pytorch as it is not installed`; next run `python setup.py clean`. After that, you can install in `develop` mode again. + +### Tips and Debugging + +* Cmake must be installed before installing Intel® Extension for PyTorch\*. If youre developing on MacOS or Linux, We recommend installing Cmake with [Homebrew](https://brew.sh/) with `brew install cmake`. +* Our `setup.py` requires Python >= 3.6 +* If you run into errors when running `python setup.py develop`, here are some debugging steps: + 1. Run `printf '#include \nint main() { printf("Hello World");}'|clang -x c -; ./a.out` to make sure your CMake works and can compile this simple Hello World program without errors. + 2. Remove your `build` directory. The `setup.py` script compiles binaries into the `build` folder and caches many details along the way. This saves time the next time you build. If you're running into issues, you can always `rm -rf build` from the toplevel `pytorch` directory and start over. + 3. If you have made edits to the Intel® Extension for PyTorch\* repo, commit any change you'd like to keep and clean the repo with the following commands (note that clean _really_ removes all untracked files and changes.): + ```bash + git submodule deinit -f . + git clean -xdf + python setup.py clean + git submodule update --init --recursive --jobs 0 # very important to sync the submodules + python setup.py develop # then try running the command again + ``` + 4. The main step within `python setup.py develop` is running `make` from the `build` directory. If you want to experiment with some environment variables, you can pass them into the command: + ```bash + ENV_KEY1=ENV_VAL1[, ENV_KEY2=ENV_VAL2]* python setup.py develop + ``` + +## Unit testing + +### Python Unit Testing + +All PyTorch test suites are located in the `test` folder and start with `test_`. Run individual test suites using the command `python test/cpu/FILENAME.py`, where `FILENAME` represents the file containing the test suite you wish to run. + +For example, to run all the TorchScript JIT tests (located at `test/cpu/test_jit.py`), you would run: + +```bash +python test/cpu/test_jit.py +``` + +You can narrow down what you're testing even further by specifying the name of an individual test with `TESTCLASSNAME.TESTNAME`. Here, `TESTNAME` is the name of the test you want to run, and `TESTCLASSNAME` is the name of the class in which it is defined. + +Let's say you want to run `test_Sequential`, which is defined as part of the `TestJit` class in `test/cpu/test_jit.py`. Your command would be: + +```bash +python test/test_jit.py TestJit.test_Sequential +``` + +The `expecttest` and `hypothesis` libraries must be installed to run the tests. `mypy` is an optional dependency, and `pytest` may help run tests more selectively. All these packages can be installed with `conda` or `pip`. + +### Better local unit tests with `pytest` + +We don't officially support `pytest`, but it works well with our `unittest` tests and offers a number of useful features for local developing. Install it via `pip install pytest`. + +If you want to run only tests that contain a specific substring, you can use the `-k` flag: + +```bash +pytest test/cpu/test_nn.py -k Loss -v +``` + +The above is an example of testing a change to all Loss functions: this command runs tests such as `TestNN.test_BCELoss` and `TestNN.test_MSELoss` and can be useful to save keystrokes. + +### Local linting + +You can run the same linting steps that are used in CI locally via `make`: + +```bash +# Lint all files +make lint -j 6 # run lint (using 6 parallel jobs) + +# Lint only the files you have changed +make quicklint -j 6 +``` + +These jobs may require extra dependencies that aren't dependencies of Intel® Extension for PyTorch\* itself, so you can install them via this command, which you should only have to run once: + +```bash +make setup_lint +``` + +To run a specific linting step, use one of these targets or see the Makefile for a complete list of options. + +```bash +# Check for tabs, trailing newlines, etc. +make quick_checks + +make flake8 + +make mypy + +make cmakelint + +make clang-tidy +``` + +To run a lint only on changes, add the `CHANGED_ONLY` option: + +```bash +make CHANGED_ONLY=--changed-only +``` + +### C++ Unit Testing + +Intel® Extension for PyTorch\* offers tests located in the `test/cpp` folder. These tests are written in C++ and use the Google Test testing framework. After compiling Intel® Extension for PyTorch\* from source, the test runner binaries will be written to the `build/bin` folder. The command to run one of these tests is `./build/bin/FILENAME --gtest_filter=TESTSUITE.TESTNAME`, where `TESTNAME` is the name of the test you'd like to run and `TESTSUITE` is the suite that test is defined in. + +For example, if you wanted to run the test `MayContainAlias`, which is part of the test suite `ContainerAliasingTest` in the file `test/cpp/jit/test_alias_analysis.cpp`, the command would be: + +```bash +./build/bin/test_jit --gtest_filter=ContainerAliasingTest.MayContainAlias +``` + +## Writing documentation + +So you want to write some documentation for your code contribution and don't know where to start? + +Intel® Extension for PyTorch\* uses [Google style](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) for formatting docstrings. Length of line inside docstrings block must be limited to 80 characters to fit into Jupyter documentation popups. + +### Building documentation + +To build the documentation: + +1. Build and install Intel® Extension for PyTorch\* (as discussed above) + +2. Install the prerequisites: + + ```bash + cd docs + pip install -r requirements.txt + ``` + +3. Generate the documentation HTML files. The generated files will be in `docs/_build/html`. + + ```bash + make clean + make html + ``` + +#### Tips + +The `.rst` source files live in [docs/tutorials](https://github.com/intel/intel-extension-for-pytorch/tree/main/docs/tutorials). Some of the `.rst` files pull in docstrings from Intel® Extension for PyTorch\* Python code (for example, via the `autofunction` or `autoclass` directives). To shorten doc build times, it is helpful to remove the files you are not working on, only keeping the base `index.rst` file and the files you are editing. The Sphinx build will produce missing file warnings but will still complete. diff --git a/cpu/2.4.0+cpu/_sources/tutorials/examples.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/examples.md.txt new file mode 100644 index 000000000..7f9cf1631 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/examples.md.txt @@ -0,0 +1,1326 @@ +Examples +======== + +These examples will guide you through using the Intel® Extension for PyTorch\* on Intel CPUs. + +You can also refer to the [Features](./features.rst) section to get the examples and usage instructions related to particular features. + +The source code for these examples, as well as the feature examples, can be found in the GitHub source tree under the [examples](https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu) directory. + +- [Python](#python) examples demonstrate usage of Python APIs: + + - [Training](#training) + - [Inference](#inference) + +- [C++](#c) examples demonstrate usage of C++ APIs +- [Intel® AI Reference Models](#intel-ai-reference-models) provide out-of-the-box use cases, demonstrating the performance benefits achievable with Intel Extension for PyTorch\* + +**Prerequisites**: +Before running these examples, please note the following: + +- Examples using the BFloat16 data type require machines with the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) BF16 and Intel® Advanced Matrix Extensions (Intel® AMX) BF16 instruction sets. + + +## Python + +### Training + +#### Distributed Training + +Distributed training with PyTorch DDP is accelerated by oneAPI Collective Communications Library Bindings for Pytorch\* (oneCCL Bindings for Pytorch\*). The extension supports FP32 and BF16 data types. More detailed information and examples are available at the [Github repo](https://github.com/intel/torch-ccl). + +**Note:** You need to install `torchvision` Python package to run the following example. + +[//]: # (marker_train_ddp_complete) +```python +import os +import torch +import torch.distributed as dist +import torchvision +import oneccl_bindings_for_pytorch as torch_ccl # noqa F401 +import intel_extension_for_pytorch as ipex + +LR = 0.001 +DOWNLOAD = True +DATA = "datasets/cifar10/" + +os.environ["MASTER_ADDR"] = "127.0.0.1" +os.environ["MASTER_PORT"] = "29500" +os.environ["RANK"] = os.environ.get("PMI_RANK", 0) +os.environ["WORLD_SIZE"] = os.environ.get("PMI_SIZE", 1) +dist.init_process_group(backend="ccl", init_method="env://") +rank = os.environ["RANK"] + +transform = torchvision.transforms.Compose( + [ + torchvision.transforms.Resize((224, 224)), + torchvision.transforms.ToTensor(), + torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), + ] +) +train_dataset = torchvision.datasets.CIFAR10( + root=DATA, + train=True, + transform=transform, + download=DOWNLOAD, +) +dist_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) +train_loader = torch.utils.data.DataLoader( + dataset=train_dataset, batch_size=128, sampler=dist_sampler +) + +model = torchvision.models.resnet50() +criterion = torch.nn.CrossEntropyLoss() +optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9) +model.train() +model, optimizer = ipex.optimize(model, optimizer=optimizer) + +model = torch.nn.parallel.DistributedDataParallel(model) + +for batch_idx, (data, target) in enumerate(train_loader): + optimizer.zero_grad() + output = model(data) + loss = criterion(output, target) + loss.backward() + optimizer.step() + print("batch_id: {}".format(batch_idx)) + +if rank == 0: + torch.save( + { + "model_state_dict": model.state_dict(), + "optimizer_state_dict": optimizer.state_dict(), + }, + "checkpoint.pth", + ) + +dist.destroy_process_group() +print("Execution finished") +``` +[//]: # (marker_train_ddp_complete) + +### Inference + +The `optimize` function of Intel® Extension for PyTorch\* applies optimizations to the model, bringing additional performance boosts. For both computer vision workloads and NLP workloads, we recommend applying the `optimize` function against the model object. + +#### Float32 + +##### Eager Mode + +###### Resnet50 + +**Note:** You need to install `torchvision` Python package to run the following example. + +[//]: # (marker_inf_rn50_imp_fp32) +```python +import torch +import torchvision.models as models + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(128, 3, 224, 224) + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model) +###################################################### # noqa F401 + +with torch.no_grad(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_imp_fp32) + +###### BERT + +**Note:** You need to install `transformers` Python package to run the following example. + +[//]: # (marker_inf_bert_imp_fp32) +```python +import torch +from transformers import BertModel + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 128 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model) +###################################################### # noqa F401 + +with torch.no_grad(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_bert_imp_fp32) + +##### TorchScript Mode + +We recommend using Intel® Extension for PyTorch\* with [TorchScript](https://pytorch.org/docs/stable/jit.html) for further optimizations. + +###### Resnet50 + +**Note:** You need to install `torchvision` Python package to run the following example. + +[//]: # (marker_inf_rn50_ts_fp32) +```python +import torch +import torchvision.models as models + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(128, 3, 224, 224) + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model) +###################################################### # noqa F401 + +with torch.no_grad(): + d = torch.rand(128, 3, 224, 224) + model = torch.jit.trace(model, d) + model = torch.jit.freeze(model) + + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_ts_fp32) + +###### BERT + +**Note:** You need to install `transformers` Python package to run the following example. + +[//]: # (marker_inf_bert_ts_fp32) +```python +import torch +from transformers import BertModel + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 128 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model) +###################################################### # noqa F401 + +with torch.no_grad(): + d = torch.randint(vocab_size, size=[batch_size, seq_length]) + model = torch.jit.trace(model, (d,), check_trace=False, strict=False) + model = torch.jit.freeze(model) + + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_bert_ts_fp32) + +##### TorchDynamo Mode (Beta, _NEW feature from 2.0.0_) + +###### Resnet50 + +**Note:** You need to install `torchvision` Python package to run the following example. + +[//]: # (marker_inf_rn50_dynamo_fp32) +```python +import torch +import torchvision.models as models + +model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT) +model.eval() +data = torch.rand(128, 3, 224, 224) + +# Beta Feature +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model, weights_prepack=False) +model = torch.compile(model, backend="ipex") +###################################################### # noqa F401 + +with torch.no_grad(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_dynamo_fp32) + +###### BERT + +**Note:** You need to install `transformers` Python package to run the following example. + +[//]: # (marker_inf_bert_dynamo_fp32) +```python +import torch +from transformers import BertModel + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 128 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) + +# Beta Feature +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model, weights_prepack=False) +model = torch.compile(model, backend="ipex") +###################################################### # noqa F401 + +with torch.no_grad(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_bert_dynamo_fp32) + +**Note:** In TorchDynamo mode, since the native PyTorch operators like `aten::convolution` and `aten::linear` are well supported and optimized in `ipex` backend, we need to disable weights prepacking by setting `weights_prepack=False` in `ipex.optimize()`. + +#### BFloat16 + +The `optimize` function works for both Float32 and BFloat16 data type. For BFloat16 data type, set the `dtype` parameter to `torch.bfloat16`. +We recommend using Auto Mixed Precision (AMP) with BFloat16 data type. + +##### Eager Mode + +###### Resnet50 + +**Note:** You need to install `torchvision` Python package to run the following example. + +[//]: # (marker_inf_rn50_imp_bf16) +```python +import torch +import torchvision.models as models + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(128, 3, 224, 224) + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model, dtype=torch.bfloat16) +###################################################### # noqa F401 + +# Note: bf16 inference requires amp.autocast() context # noqa F401 +with torch.no_grad(), torch.cpu.amp.autocast(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_imp_bf16) + +###### BERT + +**Note:** You need to install `transformers` Python package to run the following example. + +[//]: # (marker_inf_bert_imp_bf16) +```python +import torch +from transformers import BertModel + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 128 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model, dtype=torch.bfloat16) +###################################################### # noqa F401 + +# Note: bf16 inference requires amp.autocast() context # noqa F401 +with torch.no_grad(), torch.cpu.amp.autocast(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_bert_imp_bf16) + +##### TorchScript Mode + +We recommend using Intel® Extension for PyTorch\* with [TorchScript](https://pytorch.org/docs/stable/jit.html) for further optimizations. + +###### Resnet50 + +**Note:** You need to install `torchvision` Python package to run the following example. + +[//]: # (marker_inf_rn50_ts_bf16) +```python +import torch +import torchvision.models as models + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(128, 3, 224, 224) + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model, dtype=torch.bfloat16) +###################################################### # noqa F401 + +# Note: bf16 inference requires amp.autocast() context # noqa F401 +with torch.no_grad(), torch.cpu.amp.autocast(): + model = torch.jit.trace(model, torch.rand(128, 3, 224, 224)) + model = torch.jit.freeze(model) + + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_ts_bf16) + +###### BERT + +**Note:** You need to install `transformers` Python package to run the following example. + +[//]: # (marker_inf_bert_ts_bf16) +```python +import torch +from transformers import BertModel + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 128 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model, dtype=torch.bfloat16) +###################################################### # noqa F401 + +# Note: bf16 inference requires amp.autocast() context # noqa F401 +with torch.no_grad(), torch.cpu.amp.autocast(): + d = torch.randint(vocab_size, size=[batch_size, seq_length]) + model = torch.jit.trace(model, (d,), check_trace=False, strict=False) + model = torch.jit.freeze(model) + + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_bert_ts_bf16) + +##### TorchDynamo Mode (Beta, _NEW feature from 2.0.0_) + +###### Resnet50 + +**Note:** You need to install `torchvision` Python package to run the following example. + +[//]: # (marker_inf_rn50_dynamo_bf16) +```python +import torch +import torchvision.models as models + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(128, 3, 224, 224) + +# Beta Feature +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model, dtype=torch.bfloat16, weights_prepack=False) +model = torch.compile(model, backend="ipex") +###################################################### # noqa F401 + +# Note: bf16 inference requires amp.autocast() context # noqa F401 +with torch.no_grad(), torch.cpu.amp.autocast(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_dynamo_bf16) + +###### BERT + +**Note:** You need to install `transformers` Python package to run the following example. + +[//]: # (marker_inf_bert_dynamo_bf16) +```python +import torch +from transformers import BertModel + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 128 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) + +# Beta Feature +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model, dtype=torch.bfloat16, weights_prepack=False) +model = torch.compile(model, backend="ipex") +###################################################### # noqa F401 + +# Note: bf16 inference requires amp.autocast() context # noqa F401 +with torch.no_grad(), torch.cpu.amp.autocast(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_bert_dynamo_bf16) + +#### Fast Bert (*Prototype*) + +**Note:** You need to install `transformers` Python package to run the following example. + +[//]: # (marker_feature_fastbert_bf16) +```python +import torch +from transformers import BertModel + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 1 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) +torch.manual_seed(43) + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.fast_bert(model, dtype=torch.bfloat16) +###################################################### # noqa F401 + +with torch.no_grad(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_feature_fastbert_bf16) + +#### INT8 + +Starting from Intel® Extension for PyTorch\* 1.12.0, quantization feature supports both static and dynamic modes. + +##### Static Quantization + +###### Calibration + +Please follow the steps below to perform calibration for static quantization: + +1. Import `intel_extension_for_pytorch` as `ipex`. +2. Import `prepare` and `convert` from `intel_extension_for_pytorch.quantization`. +3. Instantiate a config object from `torch.ao.quantization.QConfig` to save configuration data during calibration. +4. Prepare model for calibration. +5. Perform calibration against dataset. +6. Invoke `ipex.quantization.convert` function to apply the calibration configure object to the fp32 model object to get an INT8 model. +7. Save the INT8 model into a `pt` file. + +**Note:** You need to install `torchvision` Python package to run the following example. + +[//]: # (marker_int8_static) +```python +import torch + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex +from intel_extension_for_pytorch.quantization import prepare, convert + +###################################################### # noqa F401 + +##### Example Model ##### # noqa F401 +import torchvision.models as models + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(128, 3, 224, 224) +######################### # noqa F401 + +qconfig_mapping = ipex.quantization.default_static_qconfig_mapping +# Alternatively, define your own qconfig_mapping: +# from torch.ao.quantization import MinMaxObserver, PerChannelMinMaxObserver, QConfig, QConfigMapping +# qconfig = QConfig( +# activation=MinMaxObserver.with_args(qscheme=torch.per_tensor_affine, dtype=torch.quint8), +# weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)) +# qconfig_mapping = QConfigMapping().set_global(qconfig) +prepared_model = prepare(model, qconfig_mapping, example_inputs=data, inplace=False) + +##### Example Dataloader ##### # noqa F401 +import torchvision + +DOWNLOAD = True +DATA = "datasets/cifar10/" + +transform = torchvision.transforms.Compose( + [ + torchvision.transforms.Resize((224, 224)), + torchvision.transforms.ToTensor(), + torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), + ] +) +train_dataset = torchvision.datasets.CIFAR10( + root=DATA, + train=True, + transform=transform, + download=DOWNLOAD, +) +calibration_data_loader = torch.utils.data.DataLoader( + dataset=train_dataset, batch_size=128 +) + +with torch.no_grad(): + for batch_idx, (d, target) in enumerate(calibration_data_loader): + print(f"calibrated on batch {batch_idx} out of {len(calibration_data_loader)}") + prepared_model(d) +############################## # noqa F401 + +converted_model = convert(prepared_model) +with torch.no_grad(): + traced_model = torch.jit.trace(converted_model, data) + traced_model = torch.jit.freeze(traced_model) + +traced_model.save("static_quantized_model.pt") + +print("Saved model to: static_quantized_model.pt") +``` +[//]: # (marker_int8_static) + +###### Deployment + +For deployment, the INT8 model is loaded from the local file and can be used directly for sample inference. + +Follow the steps below: + +1. Import `intel_extension_for_pytorch` as `ipex`. +2. Load the INT8 model from the saved file. +3. Run inference. + +[//]: # (marker_int8_deploy) +```python +import torch + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex # noqa F401 + +###################################################### # noqa F401 + +model = torch.jit.load("static_quantized_model.pt") +model.eval() +model = torch.jit.freeze(model) +data = torch.rand(128, 3, 224, 224) + +with torch.no_grad(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_int8_deploy) + +##### Dynamic Quantization + +Please follow the steps below to perform dynamic quantization: + +1. Import `intel_extension_for_pytorch` as `ipex`. +2. Import `prepare` and `convert` from `intel_extension_for_pytorch.quantization`. +3. Instantiate a config object from `torch.ao.quantization.QConfig` to save configuration data during calibration. +4. Prepare model for quantization. +5. Convert the model. +6. Run inference to perform dynamic quantization. +7. Save the INT8 model into a `pt` file. + +**Note:** You need to install `transformers` Python package to run the following example. + +[//]: # (marker_int8_dynamic) +```python +import torch + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex +from intel_extension_for_pytorch.quantization import prepare, convert + +###################################################### # noqa F401 + +##### Example Model ##### # noqa F401 +from transformers import BertModel + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 128 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) +######################### # noqa F401 + +qconfig_mapping = ipex.quantization.default_dynamic_qconfig_mapping +# Alternatively, define your own qconfig: +# from torch.ao.quantization import PerChannelMinMaxObserver, PlaceholderObserver, QConfig, QConfigMapping +# qconfig = QConfig( +# activation = PlaceholderObserver.with_args(dtype=torch.float, is_dynamic=True), +# weight = PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)) +# qconfig_mapping = QConfigMapping().set_global(qconfig) +prepared_model = prepare(model, qconfig_mapping, example_inputs=data) + +converted_model = convert(prepared_model) +with torch.no_grad(): + traced_model = torch.jit.trace( + converted_model, (data,), check_trace=False, strict=False + ) + traced_model = torch.jit.freeze(traced_model) + +traced_model.save("dynamic_quantized_model.pt") + +print("Saved model to: dynamic_quantized_model.pt") +``` +[//]: # (marker_int8_dynamic) + +### Large Language Model (LLM) + +Intel® Extension for PyTorch\* provides dedicated optimization for running Large Language Models (LLM) faster. +A set of data types are supported for various scenarios, including FP32, BF16, Smooth Quantization INT8, Weight Only Quantization INT8/INT4 (prototype). + +**Note:** You need to install `transformers==4.43.2` Python package to run the following example. +In addition, you may need to log in your HuggingFace account to access the pretrained model files. +Please refer to [HuggingFace login](https://huggingface.co/docs/huggingface_hub/quick-start#login). + +#### FP32/BF16 + +[//]: # (marker_llm_optimize) +```python +import torch + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +###################################################### # noqa F401 +import argparse +from transformers import ( + AutoConfig, + AutoModelForCausalLM, + AutoTokenizer, +) + +# args +parser = argparse.ArgumentParser("Generation script (fp32/bf16 path)", add_help=False) +parser.add_argument( + "--dtype", + type=str, + choices=["float32", "bfloat16"], + default="float32", + help="choose the weight dtype and whether to enable auto mixed precision or not", +) +parser.add_argument( + "--max-new-tokens", default=32, type=int, help="output max new tokens" +) +parser.add_argument( + "--prompt", default="What are we having for dinner?", type=str, help="input prompt" +) +parser.add_argument("--greedy", action="store_true") +parser.add_argument("--batch-size", default=1, type=int, help="batch size") +args = parser.parse_args() +print(args) + +# dtype +amp_enabled = True if args.dtype != "float32" else False +amp_dtype = getattr(torch, args.dtype) + +# load model +model_id = "facebook/opt-125m" +config = AutoConfig.from_pretrained(model_id, torchscript=True, trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained( + model_id, + torch_dtype=amp_dtype, + config=config, + low_cpu_mem_usage=True, + trust_remote_code=True, +) +tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) +model = model.eval() +model = model.to(memory_format=torch.channels_last) + +# Intel(R) Extension for PyTorch* +#################### code changes #################### # noqa F401 +model = ipex.llm.optimize( + model, + dtype=amp_dtype, + inplace=True, + deployment_mode=True, +) +###################################################### # noqa F401 + +# generate args +num_beams = 1 if args.greedy else 4 +generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=num_beams) + +# input prompt +prompt = args.prompt +input_size = tokenizer(prompt, return_tensors="pt").input_ids.size(dim=1) +print("---- Prompt size:", input_size) +prompt = [prompt] * args.batch_size + +# inference +with torch.no_grad(), torch.inference_mode(), torch.cpu.amp.autocast( + enabled=amp_enabled +): + input_ids = tokenizer(prompt, return_tensors="pt").input_ids + gen_ids = model.generate( + input_ids, max_new_tokens=args.max_new_tokens, **generate_kwargs + ) + gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True) + input_tokens_lengths = [x.shape[0] for x in input_ids] + output_tokens_lengths = [x.shape[0] for x in gen_ids] + total_new_tokens = [ + o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths) + ] + print(gen_text, total_new_tokens, flush=True) +``` +[//]: # (marker_llm_optimize) + +#### Smooth Quantization INT8 + +The typical steps shown in the example are: + +1. Calibration process: Run the example script specifying `--calibration`, along with other related arguments. +When the calibration process is completed, the quantization summary files would be generated. + +2. Model inference process: Run the example script without specifying `--calibration`. In this process the quantized model +will be generated via the original model and the quantization config and summary files, and will +generate results for the input prompt. + +[//]: # (marker_llm_optimize_sq) +```python +import torch + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +###################################################### # noqa F401 +import argparse +from transformers import ( + AutoConfig, + AutoModelForCausalLM, + AutoTokenizer, +) + +# args +parser = argparse.ArgumentParser( + "Generation script (static quantization path)", add_help=False +) +parser.add_argument( + "--dtype", + type=str, + choices=["float32", "bfloat16"], + default="float32", + help="choose the weight dtype and whether to enable auto mixed precision or not", +) +parser.add_argument( + "--max-new-tokens", default=32, type=int, help="output max new tokens" +) +parser.add_argument( + "--prompt", default="What are we having for dinner?", type=str, help="input prompt" +) +parser.add_argument("--greedy", action="store_true") +parser.add_argument("--batch-size", default=1, type=int, help="batch size") +parser.add_argument("--calibration", action="store_true") +parser.add_argument( + "--calibration-samples", + default=512, + type=int, + help="total number of calibration samples", +) +parser.add_argument( + "--int8-qconfig", + nargs="?", + default="./qconfig.json", + help="static quantization factors summary files generated by calibration", +) +parser.add_argument("--dataset", nargs="?", default="NeelNanda/pile-10k") +parser.add_argument( + "--alpha", default=0.5, type=float, help="alpha value for smoothquant" +) +args = parser.parse_args() +print(args) + + +# dtype +amp_enabled = True if args.dtype != "float32" and not calibration else False +amp_dtype = getattr(torch, args.dtype) if not calibration else torch.float32 + +# load model +model_id = "meta-llama/Llama-2-7b-hf" +config = AutoConfig.from_pretrained(model_id, torchscript=True, trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained( + model_id, + torch_dtype=amp_dtype, + config=config, + low_cpu_mem_usage=True, + trust_remote_code=True, +) +tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) +model = model.eval() +model = model.to(memory_format=torch.channels_last) + +num_beams = 1 if args.greedy else 4 +beam_idx_tmp = torch.zeros( + (2048, int(args.batch_size * num_beams)), dtype=torch.long +).contiguous() +global_past_key_value = [ + ( + torch.zeros(1, 0, 0, 1, dtype=torch.long).contiguous(), + torch.zeros( + [ + 1, + model.config.num_attention_heads, + 1, + int(model.config.hidden_size / model.config.num_attention_heads), + ] + ).contiguous(), + torch.zeros( + [ + 1, + user_model.config.num_attention_heads, + 1, + int(model.config.hidden_size / model.config.num_attention_heads), + ] + ).contiguous(), + beam_idx_tmp, + ) + for i in range(model.config.num_hidden_layers) +] + + +# Intel(R) Extension for PyTorch* +#################### code changes #################### # noqa F401 +class Calibration: + def __init__(self, dataset, tokenizer, batch_size=1, pad_val=1, pad_max=512): + self.dataset = dataset + self.tokenizer = tokenizer + self.batch_size = batch_size + self.pad_val = pad_val + self.pad_max = pad_max + + # tokenize the dataset + self.dataset = self.dataset.map(self.tokenize_function, batched=True) + self.dataset.set_format(type="torch", columns=["input_ids"]) + + @torch.no_grad() + def tokenize_function(self, examples): + if "prompt" in examples: + example = self.tokenizer(examples["prompt"]) + elif "text" in examples: + example = self.tokenizer(examples["text"]) + elif "code" in examples: + example = self.tokenizer(examples["code"]) + return example + + @torch.no_grad() + def collate_batch(self, batch): + position_ids_padded = [] + input_ids_padded = [] + last_ind = [] + attention_mask_padded = [] + for text in batch: + input_ids = text["input_ids"] + input_ids = ( + input_ids[: int(self.pad_max)] + if len(input_ids) > int(self.pad_max) + else input_ids + ) + last_ind.append(input_ids.shape[0] - 1) + attention_mask = torch.ones(len(input_ids)) + position_ids = torch.arange(len(input_ids)) + input_ids_padded.append(input_ids) + attention_mask_padded.append(attention_mask) + position_ids_padded.append(position_ids) + return ( + ( + torch.vstack(input_ids_padded), + torch.vstack(attention_mask_padded), + torch.vstack(position_ids_padded), + tuple(global_past_key_value), + ), + torch.tensor(last_ind), + ) + + +calib_dataset = load_dataset(args.dataset, split="train") +calib_evaluator = Calibration(calib_dataset, tokenizer, args.batch_size) +calib_dataloader = DataLoader( + calib_evaluator.dataset, + batch_size=1, + shuffle=False, + collate_fn=calib_evaluator.collate_batch, +) + +qconfig = ipex.quantization.get_smooth_quant_qconfig_mapping(alpha=args.alpha) + +if args.calibration: + example_inputs = None + for i, ( + (input_ids, attention_mask, position_ids, past_key_values), + last_ind, + ) in enumerate(calib_dataloader): + example_inputs = (input_ids, attention_mask, position_ids, past_key_values) + break + from intel_extension_for_pytorch.quantization import prepare + + model = ipex.llm.optimize( + model.eval(), + dtype=amp_dtype, + quantization_config=qconfig, + inplace=True, + deployment_mode=False, + ) + prepared_model = prepare(model.eval(), qconfig, example_inputs=example_inputs) + with torch.no_grad(): + for i, ( + (input_ids, attention_mask, position_ids, past_key_values), + last_ind, + ) in enumerate(calib_dataloader): + if i == args.calibration_samples: + break + prepared_model( + input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + past_key_values=past_key_values, + ) + + prepared_model.save_qconf_summary(qconf_summary=args.int8_qconfig) + print( + "calibration Done! Will exit and please launch model quantization and benchmark" + ) + exit(0) +else: + model = ipex.llm.optimize( + model.eval(), + dtype=amp_dtype, + quantization_config=qconfig, + qconfig_summary_file=args.int8_qconfig, + inplace=True, + deployment_mode=True, + ) + print("model quantization - Done!") + +###################################################### # noqa F401 + +# generate args +num_beams = 1 if args.greedy else 4 +generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=num_beams) + +# input prompt +prompt = args.prompt +input_size = tokenizer(prompt, return_tensors="pt").input_ids.size(dim=1) +print("---- Prompt size:", input_size) +prompt = [prompt] * args.batch_size + +# inference +with torch.no_grad(), torch.inference_mode(), torch.cpu.amp.autocast( + enabled=amp_enabled +): + input_ids = tokenizer(prompt, return_tensors="pt").input_ids + gen_ids = model.generate( + input_ids, max_new_tokens=args.max_new_tokens, **generate_kwargs + ) + gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True) + input_tokens_lengths = [x.shape[0] for x in input_ids] + output_tokens_lengths = [x.shape[0] for x in gen_ids] + total_new_tokens = [ + o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths) + ] + print(gen_text, total_new_tokens, flush=True) +``` +[//]: # (marker_llm_optimize_sq) + +#### Weight Only Quantization INT8/INT4 + +[//]: # (marker_llm_optimize_woq) +```python +import torch + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +###################################################### # noqa F401 +import argparse +from transformers import ( + AutoConfig, + AutoModelForCausalLM, + AutoTokenizer, +) + +# args +parser = argparse.ArgumentParser( + "Generation script (weight only quantization path)", add_help=False +) +parser.add_argument( + "--dtype", + type=str, + choices=["float32", "bfloat16"], + default="float32", + help="choose the weight dtype and whether to enable auto mixed precision or not", +) +parser.add_argument( + "--max-new-tokens", default=32, type=int, help="output max new tokens" +) +parser.add_argument( + "--prompt", default="What are we having for dinner?", type=str, help="input prompt" +) +parser.add_argument("--greedy", action="store_true") +parser.add_argument("--batch-size", default=1, type=int, help="batch size") +# Intel(R) Extension for PyTorch* +#################### code changes #################### # noqa F401 +parser.add_argument( + "--lowp-mode", + choices=["AUTO", "BF16", "FP32", "INT8", "FP16"], + default="AUTO", + type=str, + help="low precision mode for weight only quantization. " + "It indicates data type for computation for speedup at the cost " + "of accuracy. Unrelated to activation or weight data type." + "It is not supported yet to use lowp_mode=INT8 for INT8 weight, " + "falling back to lowp_mode=BF16 implicitly in this case." + "If set to AUTO, lowp_mode is determined by weight data type: " + "lowp_mode=BF16 is used for INT8 weight " + "and lowp_mode=INT8 used for INT4 weight", +) +parser.add_argument( + "--weight-dtype", + choices=["INT8", "INT4"], + default="INT8", + type=str, + help="weight data type for weight only quantization. Unrelated to activation" + " data type or lowp-mode. If `--low-precision-checkpoint` is given, weight" + " data type is always INT4 and this argument is not needed.", +) +parser.add_argument( + "--low-precision-checkpoint", + default="", + type=str, + help="Low precision checkpoint file generated by calibration, such as GPTQ. It contains" + " modified weights, scales, zero points, etc. For better accuracy of weight only" + " quantization with INT4 weight.", +) +###################################################### # noqa F401 +args = parser.parse_args() +print(args) + +# dtype +amp_enabled = True if args.dtype != "float32" else False +amp_dtype = getattr(torch, args.dtype) + +# load model +model_id = "facebook/opt-125m" +config = AutoConfig.from_pretrained(model_id, torchscript=True, trust_remote_code=True) +model = AutoModelForCausalLM.from_pretrained( + model_id, + torch_dtype=amp_dtype, + config=config, + low_cpu_mem_usage=True, + trust_remote_code=True, +) +tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) +model = model.eval() +model = model.to(memory_format=torch.channels_last) + +# Intel(R) Extension for PyTorch* +#################### code changes #################### # noqa F401 +from intel_extension_for_pytorch.quantization import WoqWeightDtype + +weight_dtype = ( + WoqWeightDtype.INT4 if args.weight_dtype == "INT4" else WoqWeightDtype.INT8 +) + +if args.lowp_mode == "INT8": + lowp_mode = ipex.quantization.WoqLowpMode.INT8 +elif args.lowp_mode == "FP32": + lowp_mode = ipex.quantization.WoqLowpMode.NONE +elif args.lowp_mode == "FP16": + lowp_mode = ipex.quantization.WoqLowpMode.FP16 +elif args.lowp_mode == "BF16": + lowp_mode = ipex.quantization.WoqLowpMode.BF16 +else: # AUTO + if args.low_precision_checkpoint != "" or weight_dtype == WoqWeightDtype.INT4: + lowp_mode = ipex.quantization.WoqLowpMode.INT8 + else: + lowp_mode = ipex.quantization.WoqLowpMode.BF16 + +qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping( + weight_dtype=weight_dtype, lowp_mode=lowp_mode +) +if args.low_precision_checkpoint != "": + low_precision_checkpoint = torch.load(args.low_precision_checkpoint) +else: + low_precision_checkpoint = None +model = ipex.llm.optimize( + model.eval(), + dtype=amp_dtype, + quantization_config=qconfig, + low_precision_checkpoint=low_precision_checkpoint, + deployment_mode=True, + inplace=True, +) + +###################################################### # noqa F401 + +# generate args +num_beams = 1 if args.greedy else 4 +generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=num_beams) + +# input prompt +prompt = args.prompt +input_size = tokenizer(prompt, return_tensors="pt").input_ids.size(dim=1) +print("---- Prompt size:", input_size) +prompt = [prompt] * args.batch_size + +# inference +with torch.no_grad(), torch.inference_mode(), torch.cpu.amp.autocast( + enabled=amp_enabled +): + input_ids = tokenizer(prompt, return_tensors="pt").input_ids + gen_ids = model.generate( + input_ids, max_new_tokens=args.max_new_tokens, **generate_kwargs + ) + gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True) + input_tokens_lengths = [x.shape[0] for x in input_ids] + output_tokens_lengths = [x.shape[0] for x in gen_ids] + total_new_tokens = [ + o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths) + ] + print(gen_text, total_new_tokens, flush=True) +``` +[//]: # (marker_llm_optimize_woq) + +**Note:** Please check [LLM Best Known Practice Page](https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/llm) +for detailed environment setup and LLM workload running instructions. + +## C++ + +To work with libtorch, C++ library of PyTorch, Intel® Extension for PyTorch\* provides its C++ dynamic library as well. The C++ library is supposed to handle inference workload only, such as service deployment. For regular development, use the Python interface. Unlike using libtorch, no specific code changes are required. Compilation follows the recommended methodology with CMake. Detailed instructions can be found in [PyTorch tutorial](https://pytorch.org/tutorials/advanced/cpp_export.html#depending-on-libtorch-and-building-the-application). + +During compilation, Intel optimizations will be activated automatically once C++ dynamic library of Intel® Extension for PyTorch\* is linked. + +The example code below works for all data types. + +**example-app.cpp** + +[//]: # (marker_cppsdk_sample) +```cpp +#include +#include +#include + +int main(int argc, const char* argv[]) { + torch::jit::script::Module module; + try { + module = torch::jit::load(argv[1]); + } catch (const c10::Error& e) { + std::cerr << "error loading the model\n"; + return -1; + } + + std::vector inputs; + torch::Tensor input = torch::rand({1, 3, 224, 224}); + inputs.push_back(input); + + at::Tensor output = module.forward(inputs).toTensor(); + std::cout << output.slice(/*dim=*/1, /*start=*/0, /*end=*/5) << std::endl; + std::cout << "Execution finished" << std::endl; + + return 0; +} +``` +[//]: # (marker_cppsdk_sample) + +**CMakeLists.txt** + +[//]: # (marker_cppsdk_cmake) +```cmake +cmake_minimum_required(VERSION 3.0 FATAL_ERROR) +project(example-app) + +find_package(IPEX REQUIRED) + +add_executable(example-app example-app.cpp) +target_link_libraries(example-app "${TORCH_IPEX_LIBRARIES}") + +set_property(TARGET example-app PROPERTY CXX_STANDARD 17) +``` +[//]: # (marker_cppsdk_cmake) + +**Command for compilation** + +```bash +$ cd examples/cpu/inference/cpp +$ mkdir build +$ cd build +$ cmake -DCMAKE_PREFIX_PATH= .. +$ make +``` + +If *Found IPEX* is shown as with a dynamic library path, the extension had been linked into the binary. This can be verified with Linux command *ldd*. + +```bash +$ cmake -DCMAKE_PREFIX_PATH=/workspace/libtorch .. +-- The C compiler identification is GNU XX.X.X +-- The CXX compiler identification is GNU XX.X.X +-- Detecting C compiler ABI info +-- Detecting C compiler ABI info - done +-- Check for working C compiler: /usr/bin/cc - skipped +-- Detecting C compile features +-- Detecting C compile features - done +-- Detecting CXX compiler ABI info +-- Detecting CXX compiler ABI info - done +-- Check for working CXX compiler: /usr/bin/c++ - skipped +-- Detecting CXX compile features +-- Detecting CXX compile features - done +CMake Warning at /workspace/libtorch/share/cmake/Torch/TorchConfig.cmake:22 (message): + static library kineto_LIBRARY-NOTFOUND not found. +Call Stack (most recent call first): + /workspace/libtorch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found) + /workspace/libtorch/share/cmake/IPEX/IPEXConfig.cmake:84 (FIND_PACKAGE) + CMakeLists.txt:4 (find_package) + + +-- Found Torch: /workspace/libtorch/lib/libtorch.so +-- Found IPEX: /workspace/libtorch/lib/libintel-ext-pt-cpu.so +-- Configuring done +-- Generating done +-- Build files have been written to: examples/cpu/inference/cpp/build + +$ ldd example-app + ... + libtorch.so => /workspace/libtorch/lib/libtorch.so (0x00007f3cf98e0000) + libc10.so => /workspace/libtorch/lib/libc10.so (0x00007f3cf985a000) + libintel-ext-pt-cpu.so => /workspace/libtorch/lib/libintel-ext-pt-cpu.so (0x00007f3cf70fc000) + libtorch_cpu.so => /workspace/libtorch/lib/libtorch_cpu.so (0x00007f3ce16ac000) + ... + libdnnl_graph.so.0 => /workspace/libtorch/lib/libdnnl_graph.so.0 (0x00007f3cde954000) + ... +``` \ No newline at end of file diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features.rst.txt b/cpu/2.4.0+cpu/_sources/tutorials/features.rst.txt new file mode 100644 index 000000000..d7d98b81c --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features.rst.txt @@ -0,0 +1,236 @@ +Features +======== + +This section provides a detailed overview of supported features. + +Easy-to-use Python API +---------------------- + +With only two or three clauses added to your original code, Intel® Extension for PyTorch\* provides simple frontend Python APIs and utilities to get performance optimizations such as graph optimization and operator optimization. + +Check the `API Documentation`_ for API functions description and `Examples `_ for usage guidance. + +.. note:: + + The package name used when you import Intel® Extension for PyTorch\* changed + from ``intel_pytorch_extension`` (for versions 1.2.0 through 1.9.0) to + ``intel_extension_for_pytorch`` (for versions 1.10.0 and later). Use the + correct package name depending on the version you are using. + + +Large Language Models (LLM, *NEW feature from 2.1.0*) +----------------------------------------------------- + +In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Starting from 2.1.0, specific optimizations for certain LLM models are +introduced in the Intel® Extension for PyTorch*. + +For more detailed information, check `LLM Optimizations Overview <./llm.html>`_. + +torch.compile (Beta, *NEW feature from 2.0.0*) +------------------------------------------------------ + +PyTorch* 2.0 introduces a new feature ``torch.compile`` to speed up PyTorch* code. It makes PyTorch code run faster by JIT-compiling of PyTorch code into optimized kernels. Intel® Extension for PyTorch\* enables a backend, ``ipex``, in the ``torch.compile`` to optimize generation of the graph model. + +To use the feature, import the Intel® Extension for PyTorch* and set the backend parameter of the ``torch.compile`` to ``ipex``. + +With ``torch.compile`` backend set to ``ipex``, the following will happen: + +1. Register Intel® Extension for PyTorch\* operators to Inductor. +2. Custom fusions at FX graph level, e.g., the migration of existing TorchScript-based fusion kernels in IPEX to inductor, pattern-based fusions to achieve peak performance. + +While optimizations with ``torch.compile`` apply to backend, invocation of the ``ipex.optimize`` function is highly recommended as well to apply optimizations in frontend. + +.. code-block:: python + + import torch + import intel_extension_for_pytorch as ipex + ... + model = ipex.optimize(model, weights_prepack=False) + model = torch.compile(model, backend='ipex') + ... + +ISA Dynamic Dispatching +----------------------- + +Intel® Extension for PyTorch\* features dynamic dispatching functionality to automatically adapt execution binaries to the most advanced instruction set available on your machine. + +For details, refer to `ISA Dynamic Dispatching `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/isa_dynamic_dispatch + +Auto Channels Last +------------------ + +Comparing to the default NCHW memory format, using channels_last (NHWC) memory format could further accelerate convolutional neural networks. In Intel® Extension for PyTorch*, NHWC memory format has been enabled for most key CPU operators. More detailed information is available at `Channels Last `_. + +Intel® Extension for PyTorch* automatically converts a model to channels last memory format when users optimize the model with ``ipex.optimize(model)``. With this feature, there is no need to manually apply ``model=model.to(memory_format=torch.channels_last)`` anymore. More detailed information is available at `Auto Channels Last `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/nhwc + features/auto_channels_last + +Auto Mixed Precision (AMP) +-------------------------- + +Low precision data type BFloat16 has been natively supported on 3rd Generation Xeon® Scalable Processors (aka Cooper Lake) with AVX512 instruction set. It will also be supported on the next generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions (Intel® AMX) instruction set providing further boosted performance. The support of Auto Mixed Precision (AMP) with BFloat16 for CPU and BFloat16 optimization of operators has been enabled in Intel® Extension for PyTorch\*, and partially upstreamed to PyTorch master branch. These optimizations will be landed in PyTorch master through PRs that are being submitted and reviewed. + +Prefer to use `torch.cpu.amp.autocast()` instead of `torch.autocast(device_name="cpu")`. + +For details, refer to `Auto Mixed Precision (AMP) `_. + +Bfloat16 computation can be conducted on platforms with AVX512 instruction set. On platforms with `AVX512 BFloat16 instruction `_, there will be an additional performance boost. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/amp + +Graph Optimization +------------------ + +To further optimize TorchScript performance, Intel® Extension for PyTorch\* supports transparent fusion of frequently used operator patterns such as Conv2D+ReLU and Linear+ReLU. +For more detailed information, check `Graph Optimization `_. + +Compared to eager mode, graph mode in PyTorch normally yields better performance from optimization methodologies such as operator fusion. Intel® Extension for PyTorch* provides further optimizations in graph mode. We recommend taking advantage of Intel® Extension for PyTorch* with `TorchScript `_. You may wish to run with the ``torch.jit.trace()`` function first, since it generally works better with Intel® Extension for PyTorch* than using the ``torch.jit.script()`` function. More detailed information can be found at the `pytorch.org website `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/graph_optimization + +Operator Optimization +--------------------- + +Intel® Extension for PyTorch* also optimizes operators and implements several customized operators for performance boosts. A few ATen operators are replaced by their optimized counterparts in Intel® Extension for PyTorch* via the ATen registration mechanism. Some customized operators are implemented for several popular topologies. For instance, ROIAlign and NMS are defined in Mask R-CNN. To improve performance of these topologies, Intel® Extension for PyTorch* also optimized these customized operators. + +.. currentmodule:: intel_extension_for_pytorch.nn +.. autoclass:: FrozenBatchNorm2d + +.. currentmodule:: intel_extension_for_pytorch.nn.functional +.. autofunction:: interaction + +.. currentmodule:: intel_extension_for_pytorch.nn.modules +.. autoclass:: MergedEmbeddingBag +.. autoclass:: MergedEmbeddingBagWithSGD + +**Auto kernel selection** is a feature that enables users to tune for better performance with GEMM operations. We aim to provide good default performance by leveraging the best of math libraries and enabling `weights_prepack`. The feature was tested with broad set of models. If you want to try other options, you can use `auto_kernel_selection` toggle in `ipex.optimize()` to switch, and you can disable `weights_prepack` in `ipex.optimize()` if you are more concerned about the memory footprint than performance gain. However, in most cases, we recommend sticking with the default settings for the best experience. + + +Optimizer Optimization +---------------------- + +Optimizers are one of key parts of the training workloads. Intel® Extension for PyTorch* brings two types of optimizations to optimizers: + +1. Operator fusion for the computation in the optimizers. +2. SplitSGD for BF16 training, which reduces the memory footprint of the master weights by half. + + +For details, refer to `Optimizer Fusion `_ and `Split SGD `_ + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/optimizer_fusion + features/split_sgd + +Runtime Extension +----------------- + +Intel® Extension for PyTorch* Runtime Extension provides PyTorch frontend APIs for users to get finer-grained control of the thread runtime and provides: + +- Multi-stream inference via the Python frontend module MultiStreamModule. +- Spawn asynchronous tasks from both Python and C++ frontend. +- Program core bindings for OpenMP threads from both Python and C++ frontend. + +.. note:: Intel® Extension for PyTorch* Runtime extension is still in the prototype stage. The API is subject to change. More detailed descriptions are available in the `API Documentation `_. + +For more detailed information, check `Runtime Extension `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/runtime_extension + +INT8 Quantization +----------------- + +Intel® Extension for PyTorch* provides built-in quantization recipes to deliver good statistical accuracy for most popular DL workloads including CNN, NLP and recommendation models. + +Users are always recommended to try quantization with the built-in quantization recipe first with Intel® Extension for PyTorch* quantization APIs. For even higher accuracy demandings, users can try with separate `recipe tuning APIs `_. The APIs are powered by Intel® Neural Compressor to take advantage of its tuning feature. + +Smooth quantization (SmoothQuant) is a more recent post-training quantization (PTQ) solution which tackles the quantization error problem caused by systematic outliers in activations. SmoothQuant is commonly used for LLM quantization, and Intel® Extension for PyTorch* has provided built-in support for this solution. + +Check more detailed information for `INT8 Quantization `_ and `INT8 recipe tuning API guide (Prototype) `_. In addition, SmoothQuant specific argument introduction and examples can be checked in `SmoothQuant recipe tuning API guide (Prototype) `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/int8_overview + features/int8_recipe_tuning_api + features/sq_recipe_tuning_api + +Codeless Optimization (Prototype, *NEW feature from 1.13.0*) +--------------------------------------------------------------- + +This feature enables users to get performance benefits from Intel® Extension for PyTorch* without changing Python scripts. It hopefully eases the usage and has been verified working well with broad scope of models, though in few cases there could be small overhead comparing to applying optimizations with Intel® Extension for PyTorch* APIs. + +For more detailed information, check `Codeless Optimization `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/codeless_optimization.md + +Graph Capture (Prototype, *NEW feature from 1.13.0*) +------------------------------------------------------- + +Since graph mode is key for deployment performance, this feature automatically captures graphs based on set of technologies that PyTorch supports, such as TorchScript and TorchDynamo. Users won't need to learn and try different PyTorch APIs to capture graphs, instead, they can turn on a new boolean flag `--graph_mode` (default off) in `ipex.optimize()` to get the best of graph optimization. + +For more detailed information, check `Graph Capture `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/graph_capture + +HyperTune (Prototype, *NEW feature from 1.13.0*) +--------------------------------------------------- + +HyperTune is an prototype feature to perform hyperparameter/execution configuration searching. The searching is used in various areas such as optimization of hyperparameters of deep learning models. The searching is extremely useful in real situations when the number of hyperparameters, including configuration of script execution, and their search spaces are huge that manually tuning these hyperparameters/configuration is impractical and time consuming. Hypertune automates this process of execution configuration searching for the `launcher `_ and Intel® Extension for PyTorch*. + +For more detailed information, check `HyperTune `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/hypertune + +Fast BERT Optimization (Prototype, *NEW feature from 2.0.0*) +--------------------------------------------------------------- + +Intel proposed a technique to speed up BERT workloads. Implementation is integrated into Intel® Extension for PyTorch\*. An API `ipex.fast_bert()` is provided for a simple usage. + +Currently `ipex.fast_bert` API is well optimized for training tasks. It works for inference tasks, though, please use the `ipex.optimize` API with graph mode to achieve the peak performance. + +For more detailed information, check `Fast BERT `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/fast_bert diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/amp.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/amp.md.txt new file mode 100644 index 000000000..836852ef2 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/amp.md.txt @@ -0,0 +1,104 @@ +Auto Mixed Precision (AMP) +========================== + +## Introduction + +`torch.cpu.amp` provides convenience for auto data type conversion at runtime. Deep learning workloads can benefit from lower-precision floating point data types such as `torch.float16` or `torch.bfloat16`, because of its lighter calculation workload and smaller memory usage. Accuracy is sacrificed when using lower-precision floating point data types so there's a trade-off between accuracy and performance. Thus, some operations should use the slower but more accurate`torch.float32`, while others can be converted to use the faster but less accurate `torch.float16` data type. The Auto Mixed Precision (AMP) feature automates the tuning of data type conversions over all operators. + +`torch.cpu.amp` only supports `torch.bfloat16`. It is the default lower precision floating point data type when `torch.cpu.amp` is enabled. `torch.cpu.amp` primarily benefits when running on Intel CPU with BFloat16 instruction set support. + +Prefer to use `torch.cpu.amp.autocast()` instead of `torch.autocast(device_name="cpu")`. + +## Use Case + +The following simple network should show a speedup with mixed precision. + +``` +class SimpleNet(torch.nn.Module): + def __init__(self): + super(SimpleNet, self).__init__() + self.conv = torch.nn.Conv2d(64, 128, (3, 3), stride=(2, 2), padding=(1, 1), bias=False) + + def forward(self, x): + return self.conv(x) +``` + +### Default Precision + +Without `torch.cpu.amp`, the network executes all operators with default precision (`torch.float32`). +``` +model = SimpleNet() +x = torch.rand(64, 64, 224, 224) +y = model(x) +``` + +### Inference with Eager Path + +`torch.cpu.amp.autocast` is designed to be a context manager that allow scopes of your script to run with mixed precision. In these scopes, operations run in a data type chosen by the `autocast` class to improve performance while maintaining accuracy. See the operations category section for details on what precision the `autocast` class chooses for each operator, and under what circumstances. + +``` +model = SimpleNet().eval() +x = torch.rand(64, 64, 224, 224) +with torch.cpu.amp.autocast(): + y = model(x) +``` + +### Inference with TorchScript Path + +`torch.cpu.amp.autocast` can be used with `torch.jit.trace` to apply graph optimization. Due to PyTorch limitation, only `torch.jit.trace` is supported. + +``` +model = SimpleNet().eval() +x = torch.rand(64, 64, 224, 224) +with torch.cpu.amp.autocast(): + model = torch.jit.trace(model, x) + model = torch.jit.freeze(model) + y = model(x) +``` + +### Training Support + +`torch.cpu.amp.autocast` can be used in training to improve performance. + +``` +model = SimpleNet() +optimizer = torch.optim.SGD(model.parameters(), lr=0.001) +for images, label in train_loader(): + with torch.cpu.amp.autocast(): + loss = criterion(model(images), label) + loss.backward() + optimizer.step() +``` + +## Autocast Op Reference + +### Op Eligibility + +Ops that run in `float64` or non-floating-point dtypes are not eligible for mixed precision, and will run in these types whether or not autocast is enabled. + +Only out-of-place ops and Tensor methods are eligible for mixed precision. In-place variants and calls that explicitly supply an `out=...` Tensor +are allowed in autocast-enabled regions, but won't go through autocasting. For example, in an autocast-enabled region `a.addmm(b, c)` can autocast, but `a.addmm_(b, c)` and `a.addmm(b, c, out=d)` cannot. For best performance and stability, use out-of-place ops in autocast-enabled regions. + +### Op-Specific Behavior + +The following lists describe the behavior of eligible ops in autocast-enabled regions. These ops always go through autocasting whether they are invoked as part of a `torch.nn.Module`, as a function, or as a `torch.Tensor` method. If functions are exposed in multiple namespaces, they go through autocasting regardless of the namespace. + +Ops not listed below do not go through autocasting. They run in the type defined by their inputs. However, autocasting may still change the type in which unlisted ops run if they're downstream from autocasted ops. + +If an op is unlisted, we assume it's numerically stable in `bfloat16`. If you believe that an unlisted op is numerically unstable in `bfloat16`, file a [GitHub issue](https://github.com/intel/intel-extension-for-pytorch/issues). + +#### Ops that can autocast to `bfloat16` + +`conv1d`, `conv2d`, `conv3d`, `conv_transpose1d`, `conv_transpose2d`, `conv_transpose3d`, `bmm`, `mm`, `baddbmm`, `addmm`, `addbmm`, `linear`, `matmul`, `conv_tbc`, `group_norm`, `_native_multi_head_attention` + +#### Ops that can autocast to `float32` + +`avg_pool3d`, `binary_cross_entropy`, `grid_sampler`, `polar`, `prod`, `quantile`, `nanquantile`, `stft`, `cdist`, `trace`, `view_as_complex`, `cholesky`, `cholesky_inverse`, `cholesky_solve`, `inverse`, `lu_solve`, `matrix_rank`, `orgqr`, `ormqr`, `pinverse`, `max_unpool2d`, `max_unpool3d`, `adaptive_avg_pool3d`, `reflection_pad1d`, `reflection_pad2d`, `replication_pad1d`, `replication_pad2d`, `replication_pad3d`, `mse_loss`, `cosine_embedding_loss`, `nll_loss`, `nll_loss2d`, `hinge_embedding_loss`, `poisson_nll_loss`, `smooth_l1_loss`, `cross_entropy_loss`, `l1_loss`, `huber_loss`, `margin_ranking_loss`, `soft_margin_loss`, `triplet_margin_loss`, `multi_margin_loss`, `ctc_loss`, `kl_div`, `multilabel_margin_loss`, `binary_cross_entropy_with_logits`, `fft_fft`, `fft_ifft`, `fft_fft2`, `fft_ifft2`, `fft_fftn`, `fft_ifftn`, `fft_rfft`, `fft_irfft`, `fft_rfft2`, `fft_irfft2`, `fft_rfftn`, `fft_irfftn`, `fft_hfft`, `fft_ihfft`, `linalg_cond`, `linalg_matrix_rank`, `linalg_solve`, `linalg_cholesky`, `linalg_svdvals`, `linalg_eigvals`, `linalg_eigvalsh`, `linalg_inv`, `linalg_householder_product`, `linalg_tensorinv`, `linalg_tensorsolve`, `fake_quantize_per_tensor_affine`, `eig`, `geqrf`, `lstsq`, `_lu_with_info`, `qr`, `svd`, `symeig`, `triangular_solve`, `fractional_max_pool2d`, `fractional_max_pool3d`, `adaptive_max_pool3d`, `multilabel_margin_loss_forward`, `linalg_qr`, `linalg_cholesky_ex`, `linalg_svd`, `linalg_eig`, `linalg_eigh`, `linalg_lstsq`, `linalg_inv_ex` + +#### Ops that promote to the widest input type + +These ops don't require a particular dtype for stability, but take multiple inputs and require that the inputs' dtypes match. If all of the inputs are `bfloat16`, the op runs in `bfloat16`. If any of the inputs is `float32`, autocast casts all inputs to `float32` and runs the op in `float32`. + +`cat`, `stack`, `index_copy` + +Some ops not listed here (e.g., binary ops like `add`) natively promote inputs without autocasting's intervention. If inputs are a mixture of `bfloat16` and `float32`, these ops run in `float32` and produce `float32` output, regardless of whether autocast is enabled. diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/auto_channels_last.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/auto_channels_last.md.txt new file mode 100644 index 000000000..5a5632160 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/auto_channels_last.md.txt @@ -0,0 +1,29 @@ +Auto Channels Last +================== + +Channels last memory format is known to have performance advantage over channels first memory format. Refer to [Channels Last](./nhwc.md) for details. +Intel® Extension for PyTorch\* automatically converts the model to channels last memory format by default when users optimize their model with `ipex.optimize(model)`. + +## Ease-of-use auto channels last API +#### default +```python +model = ipex.optimize(model) # by default, model is channels last +``` + +#### enable +```python +ipex.enable_auto_channels_last() +model = ipex.optimize(model) # enable, model is channels last +``` + +#### disable +```python +ipex.disable_auto_channels_last() +model = ipex.optimize(model) # disable, model is channels first +``` + +## Known issue +For broad models, channels last memory format brings performance boost over channels first memory format. However, for few use cases, this may bring performance regression. If performance regression is observed, we recommend to feed sample input data to `ipex.optimize(model, sample_input=...)`. +```python +model = ipex.optimize(model, sample_input=...) +``` diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/codeless_optimization.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/codeless_optimization.md.txt new file mode 100644 index 000000000..a7bbbda80 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/codeless_optimization.md.txt @@ -0,0 +1,106 @@ +Codeless Optimization (Prototype) +==================================== + +This feature aims to get inference performance benefits from Intel® Extension for PyTorch\* without changing code in your python scripts, which can raise Out-of-Box (OOB) experience to get started with Intel® Extension for PyTorch\* easily. Users who already known how to apply optimizations with Intel® Extension for PyTorch\* APIs are not targeted for this feature, due to the inevitable overhead and limitations we mentioned below. + +## Motivation + +A typical use case of inference as in [transformer](https://github.com/huggingface/transformers/blob/v4.21.1/src/transformers/trainer.py#L3187) can be simplified as the code snippet below: + +``` +import torch +model = Model().eval() +with torch.no_grad(): + for input in dataloader(): + model(**input) +``` + +To utilize optimizations of Intel® Extension for PyTorch\* for optimum performance, several lines code changes are required/recommended. + +``` +import torch +import intel_extension_for_pytorch as ipex # clause added +model = Model().eval() +model = ipex.optimization(model) # clause added +with torch.no_grad(): + with torch.cpu.amp.autocast(): # clause added for running with BFloat16 (Optional) + input = ... # clause added for TorchScript (Optional, but recommended) + model = torch.jit.trace(input) # clause added for TorchScript (Optional, but recommended) + model = torch.jit.freeze() # clause added for TorchScript (Optional, but recommended) + for input in dataloader(): + model(**input) +``` + +With this feature, code changes above done manually are not required any more. Intel® Extension for PyTorch\* optimizations will be applied automatically during execution in a monkey patch way. +* Automatically import `intel_extension_for_pytorch` package: It applies Intel® Extension for PyTorch\* optimizations, such as: `torch.embedding_bag`, `torch.cpu.amp.autocast`. It also registers Intel® Extension for PyTorch\* JIT fusion pass and thus benefits the graph mode inference performance. +* Automatically apply `ipex.optimize()` function. Only features enabled by default parameter values are supported, such as: + * Auto generate FX or Jit Graph. + * Auto Channel Last convert. + * Conv-Bn folding. + * Weight prepack. + * Replace dropout with identity. + * Optimize LSTM. +* Automatically apply `torch.cpu.amp.autocast` with BFloat16 data type for inference. + +## Example Usage with HuggingFace +Let's take the [QA case](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) in HuggingFace as an example. + +### The origin command with ipex launch +Here is the command to run with [`ipexrun`](../performance_tuning/launch_script.md). +``` +clear && ipexrun --memory-allocator default --ninstances 2 --ncores-per-instance 28 run_qa.py --model_name_or_path bert-base-uncased --dataset_name squad --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/ +``` + +### Command to apply ipex optimization for FP32 +Added `--auto-ipex` +``` +clear && ipexrun --memory-allocator default --ninstances 2 --ncores-per-instance 28 --auto-ipex run_qa.py --model_name_or_path bert-base-uncased --dataset_name squad --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/ +``` + +### Command to apply ipex optimization for BF16 +Added `--auto-ipex --dtype bfloat16` +``` +clear && ipexrun --memory-allocator default --ninstances 2 --ncores-per-instance 28 --auto-ipex --dtype bfloat16 run_qa.py --model_name_or_path bert-base-uncased --dataset_name squad --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/ +``` + +## Use Case not supported +### Module uses forward method explicitly instead of the `__call__` attr +``` +import torch +class DummyModule(torch.nn.Module): + def __init__(self,): + super(DummyModule, self).__init__() + self.input1 = torch.randn(1, 3, 224, 224) + self.conv = torch.nn.Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3)) + self.bn = torch.nn.BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) + + def forward(self, x): + return self.bn(self.conv(x)) + + def customized_forward(self, x): + return self.bn(self.conv(x)) + +# Method1 will success +DummyModule()(input) +# Method2 will fail to apply ipex.optimize in the top-level model +DummyModule().customized_forward(input) +``` +If a model uses forward method explicitly instead of the `__call__` attr, we are unable to hook the execution of this model. As result, we are unable to auto apply the optimizations to this `DummyModule()`. + +### Already using `ipex.optimize` +User already invokes `ipex.optimize` in script is not targeted for this feature. The behaviour as repeated invoking of `ipex.optimize` is not defined. The second invoking of `ipex.optimize` for the same module will fail with error message to avoid this behaviour. + +### Already using Jit Trace +For Jit trace case (as below example code) is not planned to support at first stage: +``` +import torch +model = Model().eval() +traced_model = torch.jit.trace(model, x).eval() +traced_model = torch.jit.freeze(traced_model) +with torch.no_grad(): + for input in dataloader(): + traced_model(input) +``` +For 2 reasons: +* The auto graph mode support has already been included in `ipex.optimize` with graph first API in 1.13. +* Extra launch parameters and Monkey patches are needed to support above case. We will focus on the feasibility of first use case in TorchVision and HuggingFace workloads. diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/fast_bert.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/fast_bert.md.txt new file mode 100644 index 000000000..d0621131f --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/fast_bert.md.txt @@ -0,0 +1,43 @@ +Fast BERT (Prototype) +======================== + +### Feature Description + +Intel proposed a technique to speed up BERT workloads. Implementation leverages the idea from [*Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning & HPC Workloads*](https://arxiv.org/pdf/2104.05755.pdf). + +Currently `ipex.fast_bert` API is only well optimized for training. For inference, it ensures functionality, while to get peak perf, please use `ipex.optimize` API + torchscript. + +### Prerequisite + +- Transformers 4.6.0 ~ 4.43.2 + +### Usage Example + +An API `ipex.fast_bert` is provided for a simple usage. Usage of this API follows the pattern of `ipex.optimize` function. More detailed description of API is available at [Fast BERT API doc](../api_doc) + +[//]: # (marker_feature_fastbert_bf16) +```python +import torch +from transformers import BertModel + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 1 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) +torch.manual_seed(43) + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.fast_bert(model, dtype=torch.bfloat16) +###################################################### # noqa F401 + +with torch.no_grad(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_feature_fastbert_bf16) diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/graph_capture.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/graph_capture.md.txt new file mode 100644 index 000000000..c0f1d66c7 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/graph_capture.md.txt @@ -0,0 +1,30 @@ +Graph Capture (Prototype) +============================ + +### Feature Description + +This feature automatically applies a combination of TorchScript trace technique and TorchDynamo to try to generate a graph model, for providing a good user experience while keeping execution fast. Specifically, the process tries to generate a graph with TorchScript trace functionality first. In case of generation failure or incorrect results detected, it changes to TorchDynamo with TorchScript backend. Failure of the graph generation with TorchDynamo triggers a warning message. Meanwhile the generated graph model falls back to the original one. I.e. the inference workload runs in eager mode. Users can take advantage of this feature through a new knob `--graph_mode` of the `ipex.optimize()` function to automatically run into graph mode. + +### Usage Example + +[//]: # (marker_feature_graph_capture) +```python +import torch +import torchvision.models as models + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(1, 3, 224, 224) + +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex + +model = ipex.optimize(model, graph_mode=True) +###################################################### # noqa F401 + +with torch.no_grad(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_feature_graph_capture) diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/graph_optimization.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/graph_optimization.md.txt new file mode 100644 index 000000000..089432948 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/graph_optimization.md.txt @@ -0,0 +1,233 @@ +Graph Optimization +================== + +Most Deep Learning models could be described as a DAG (directed acyclic graph). Optimizing a deep learning model from a graph perspective is straight forward. Compared to the operator optimization and algorithm optimization, the graph optimization is at a higher level. It covers not only the graph but also the runtime. From the operator perspective, the graph optimization contains the operator fusing and constant folding. From the runtime perspective, the graph optimization contains the operator scheduling, computation resources management, and memory management. + +The Intel® Extension for PyTorch\* focuses on operator related graph optimizations. The extension also provides some prototype features for the related runtime optimizations. Refer to the runtime extension for more details about runtime optimization. + +## Ease-of-use graph optimization API +The graph optimizations of Intel® Extension for PyTorch\* are enabled by default. Users can disable it by calling: +``` +ipex.enable_onednn_fusion(False) +``` + +### FP32 and BF16 models + +[//]: # (marker_feature_graph_optimization_fp32_bf16) +```python +import torch +import torchvision.models as models + +# Import the Intel Extension for PyTorch +import intel_extension_for_pytorch as ipex + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() + +# Apply some fusions at the front end +model = ipex.optimize(model, dtype=torch.float32) + +x = torch.randn(4, 3, 224, 224) +with torch.no_grad(): + model = torch.jit.trace(model, x, check_trace=False).eval() + # Fold the BatchNormalization and propagate constant + torch.jit.freeze(model) + # Print the graph + print(model.graph_for(x)) + +print("Execution finished") +``` +[//]: # (marker_feature_graph_optimization_fp32_bf16) + +Compared to the original code, the model launcher needs to add a few lines of code and the extension will automatically accelerate the model. Regarding the RN50, the extension will automatically fuse the Conv + ReLU and Conv + Sum + ReLU as ConvReLU and ConvSumReLU. If you check the output of `graph_for`, you will observe the fused operators. + +### INT8 models + +[//]: # (marker_feature_graph_optimization_int8) +```python +import torch +import torchvision.models as models +import intel_extension_for_pytorch as ipex +from intel_extension_for_pytorch.quantization import prepare, convert + +# construct the model +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +qconfig = ipex.quantization.default_static_qconfig +model.eval() +example_inputs = torch.rand(1, 3, 224, 224) +prepared_model = prepare(model, qconfig, example_inputs=example_inputs, inplace=False) + +##### Example Dataloader ##### # noqa F401 +import torchvision + +DOWNLOAD = True +DATA = "datasets/cifar10/" + +transform = torchvision.transforms.Compose( + [ + torchvision.transforms.Resize((224, 224)), + torchvision.transforms.ToTensor(), + torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), + ] +) +train_dataset = torchvision.datasets.CIFAR10( + root=DATA, + train=True, + transform=transform, + download=DOWNLOAD, +) +calibration_data_loader = torch.utils.data.DataLoader( + dataset=train_dataset, batch_size=128 +) + +with torch.no_grad(): + for batch_idx, (d, target) in enumerate(calibration_data_loader): + print(f"calibrated on batch {batch_idx} out of {len(calibration_data_loader)}") + prepared_model(d) +############################## # noqa F401 + +convert_model = convert(prepared_model) +with torch.no_grad(): + traced_model = torch.jit.trace(convert_model, example_inputs) + traced_model = torch.jit.freeze(traced_model) + +traced_model.save("quantized_model.pt") + +# Deployment +quantized_model = torch.jit.load("quantized_model.pt") +quantized_model = torch.jit.freeze(quantized_model.eval()) +images = torch.rand(1, 3, 244, 244) +with torch.no_grad(): + output = quantized_model(images) + +print("Execution finished") +``` +[//]: # (marker_feature_graph_optimization_int8) + +## Methodology +### Fusion +#### FP32 and BF16 fusion patterns +- Conv1D/Conv2D/Conv3D/Linear/ConvTranspose2D/ConvTranspose3D + Abs/Clamp/Elu/Exp/GELU/HardTanh/HardSwish/Log/Mish/Sigmoid/Pow/ReLU/Round/Sqrt/Square/Tanh/Leaky_ReLU/SiLU +- Conv1D/Conv2D/Conv3D/Linear/ConvTranspose2D/ConvTranspose3D + Sigmoid + MUL +- Conv1D/Conv2D/Conv3D/Linear + SUM +- Conv1D/Conv2D/Conv3D + SUM + ReLU +- Add + LayerNorm +- Div + Add + Softmax +- Linear + Linear + Linear +- View + Transpose + Contiguous + View + +#### INT8 fusion patterns +The `ipex.quantization.convert(model, conf, inputs)` API will convert an FP32 `torch.nn.Module` to a quantized JIT ScriptModule according to the given quantization recipes. + +For example, for a FP32 model of one single convolution, the graph before and after conversion will be: +![image](../../../images/graph_optimization/int8_pattern.png) + +The oneDNN graph backend will select `dequantize` and `convolution` into one partition. During execution, this partition will execute a convolution with int8 as input and fp32 as output. + +Here listed all the currently supported int8 patterns in Intel® Extension for PyTorch\* using oneDNN graph backend: + +1. Conv/Linear/Matmul related fusion patterns + ``` + | + [Quantize]* + | | + Dequantize Dequantize + \ / + Conv1D/Conv2D/Conv3D/Linear/MatMul + | + [Abs/Elu/GELU/HardTanh/Leaky_ReLU/Sigmoid/ + ReLU/Sqrt/Square/Tanh/[Dequantize+Add]*[0,1] ]*[0,3] + | + [Quantize]* + | + ``` + + ``` + | | + Dequantize Dequantize + \___ ___/ + MatMul + \ / + Divide + \ / + [Add]* + | + ``` + +2. Non-Conv/Linear/Matmul related fusion patterns + ``` + | + Dequantize + | + MaxPool2D + | + Quantize + ``` +3. INT8-BF16 mixed-precision fusion patterns + ``` + | | + Dequantize Dequantize + | | + To To + \___ ___/ + MatMul + \ / + [Divide]* + \ / + [Add]* + | + ``` + + ``` + | | + Dequantize Dequantize + | | + To To + \___ ___/ + MatMul + | + [GeLU]* + | + To + | + Quantize + | + ``` + + ``` + | | + Dequantize Dequantize + | | + To To Dequantize + \___ ___/ | + MatMul To + \_____ ___/ + [Add]* + | + ``` + + +### Folding +Stock PyTorch provids constant propagation and BatchNormalization folding. These optimizations are automatically applied to the jit model by invoking `torch.jit.freeze`. Take the Resnet50 as an example: + +[//]: # (marker_feature_graph_optimization_folding) +```python +import torch +import torchvision.models as models + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +x = torch.randn(4, 3, 224, 224) + +with torch.no_grad(): + model = torch.jit.trace(model, x, check_trace=False).eval() + # Fold the BatchNormalization and propagate constant + torch.jit.freeze(model) + # Print the graph + print(model.graph_for(x)) + +print("Execution finished") +``` +[//]: # (marker_feature_graph_optimization_folding) + +If the model owner does not invoke the `torch.jit.freeze`, the `BatchNormalization` still exists on the graph. Otheriwse, the `BatchNormalization` will be folded on the graph to save the compuation and then improve the performance. Refer to the [Constant Folding Wikipedia page](https://en.wikipedia.org/wiki/Constant_folding) for more details. diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/hypertune.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/hypertune.md.txt new file mode 100644 index 000000000..e8bc828b8 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/hypertune.md.txt @@ -0,0 +1,120 @@ +HyperTune (Prototype) +======================== + +![HyperTune](../../../images/hypertune/hypertune.png) + +HyperTune is an prototype feature to perform hyperparameter/execution configuration searching. The searching is used in various areas such as optimization of hyperparameters of deep learning models. The searching is extremely useful in real situations when the number of hyperparameters, including configuration of script execution, and their search spaces are huge that manually tuning these hyperparameters/configuration is impractical and time consuming. Hypertune automates this process of execution configuration searching for the [launcher](../performance_tuning/launch_script.md) and Intel® Extension for PyTorch\*. + +## Usage of Hypertune +``` +python -m intel_extension_for_pytorch.cpu.hypertune --conf-file [args] +``` + +There are two things to provide Hypertune (1) `` .yaml file to define the hyperparameters and their search spaces (2) `` as an optimization function. + +### `your_conf_file` +The .yaml file is used to define configuration of Hypertune. There are two main information needed: (1) hyperparameters to tune and their search spaces (2) tuning strategy. See comments below together with a sample .yaml file. + +``` +tuning: # optional. + strategy: grid # optional. The tuning strategy. Default is grid. Must be one of {grid, random}. + max_trials: 100 # optional. Allowed number of trials. Default is 100. If given time, set max_trials to product of length of all search spaces to try all possible combinations of hyperparameters. + +output_dir: /path/to/saving/directory # optional. Directory to which the tuning history will be saved in record.csv file. Default is current working directory. + +hyperparams: # mandatory. + launcher: # optional. + hp: ['ncores_per_instance', 'ninstances'] # mandatory. Mandatory if hyperparams.launcher is specified. Specify the launcher hyperparameters to tune. + ncores_per_instance: all_physical_cores # optional. Search space of ncores_per_instance if chosen to tune. If not defined, default search space of ncore_per_instance is used. + ninstances: [1] # optional. Search space of ninstances if chosen to tune. If not defined, default search space of ninstances is used. +``` + +### Hyperparameters +#### Launcher Hyperparameters +Currently hypertune tunes for the following launcher hyperparameters: + +| hyperparameter | default value | default search space | search space format | +| :-- | :--: | :--: | :--: | +| ```ncores_per_instance``` | -1 | `all_logical_cores` | `str or list of int. str must be one of {'all_logical_cores', 'all_physical_cores'}` | +| ```ninstances``` | -1 | `all_logical_cores` | `str or list of int. str must be one of {'all_logical_cores', 'all_physical_cores'}` | +| ```use_all_nodes``` | True | `[True, False] if num_nodes > 1 else [True]` | `list of bool` | +| ```use_logical_cores``` | False | `[True, False] if is_hyperthreading_enabled else [False]` | `list of bool` | +| ```disable_numactl``` | False | `[True, False]` | `list of bool` | +| ```disable_iomp``` | False | `[True, False]` | `list of bool` | +| ```malloc``` | tc | `['tc', 'je', 'pt']` | `list of str. str must be in {'tc', 'je', 'pt'}` | + +### Defining hyperparameters and their search spaces +#### 1. Defining hyperparameters to tune: + +List the hyperparameters to tune in `hp`. For example, to tune all launcher hyperparameters: +``` +hyperparams: + launcher: + hp: ['ncores_per_instance', 'ninstances', 'use_all_nodes', 'use_logical_cores', 'disable_numactl', 'disable_iomp', 'malloc'] +``` + +For example, to tune only launcher `ncores_per_instance`: +``` +hyperparams: + launcher: + hp: ['ncores_per_instance'] +``` +All other launcher hyperparameters (`ninstances`, `use_all_nodes`, `use_logical_core`, `disable_numactl`, `disable_iomp`, `malloc`) will not be tuned and instead will use the default value defined in the previous section. + +#### 2. Defining the search spaces of the hyperparameters: + +#### Default search space + +If you don't specify the search space of a hyperparamter, then the default search space defined in the previous section will be used for the hyperparameters defined in `hp`. For example, +``` +hyperparams: + launcher: + hp: ['malloc'] +``` +`malloc` will be tuned using its default search space, `['tc', 'je', 'pt']`. All other launcher hyperparamters (`ncores_per_instance`, `ninstances`, `use_all_nodes`, `use_logical_cores`, `disable_numactl`, `disable_iomp`) will not be tuned and instead will use their default values. + +#### User defined search space + +Specify the search space of a hyperparameter. For example, +``` +hyperparams: + launcher: + hp: ['ncores_per_instance', 'ninstances', 'malloc'] + ninstances: [1] + ncore_per_instance: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] +``` +`ninstances` and `ncores_per_instance` will use user defined spaces `[1]` and `[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]` respectively. `malloc` will use its default search space, `['tc', 'je', 'pt']`. + +### `` +This is the script as an optimization function. +- Step 1. Print the objective(s) you want to optimize. Make sure this is just an int or float to be minimized or maximized. +- Step 2. Just before the objective(s), add print statement(s) of the `@hypertune {'name': str, 'higher_is_better': bool, 'target_val': int or float}`. +``` +'name' # mandatory. The name of your objective function. +'higher_is_better' # optional. True if objective function is to be maximized, False if to be minimized. Default is False. +'target_val' # optional. Target value of the objective function. Default is -float('inf') +``` + +Have a look at the [example script](https://github.com/intel/intel-extension-for-pytorch/tree/v2.0.100+cpu/intel_extension_for_pytorch/cpu/hypertune/example/resnet50.py). + +## Usage Examples + +**Tuning `ncores_per_instance` for minimum `latency`** + +Suppose we want to tune `ncores_per_instance` for a single instance to minimize latency for resnet50 on a machine with two Intel(R) Xeon(R) Platinum 8180M CPUs. Each socket has 28 physical cores and another 28 logical cores. + +Run the following command with [example.yaml](https://github.com/intel/intel-extension-for-pytorch/tree/v2.0.100+cpu/intel_extension_for_pytorch/cpu/hypertune/example/example.yaml) and [resnet50.py](https://github.com/intel/intel-extension-for-pytorch/tree/v2.0.100+cpu/intel_extension_for_pytorch/cpu/hypertune/example/resnet50.py): +``` +python -m intel_extension_for_pytorch.cpu.hypertune --conf_file /example/example.yaml /example/resnet50.py +``` + +Once search completes, it will print to terminal the best tune result and best tune configuration found. Below is an output for this example: +``` +Best configuration found is: {'ncores_per_instance': 15, 'ninstances': 1, 'use_all_nodes': True, 'use_logical_cores': False, 'disable_numactl': False, 'disable_iomp': False, 'malloc': 'tc'} +latency: 12.339081764221191 +``` +15 `ncores_per_instance` gave the minimum latency. + +You will also find the tuning history in `/record.csv`. You can take [a sample csv file](https://github.com/intel/intel-extension-for-pytorch/tree/v2.0.100+cpu/intel_extension_for_pytorch/cpu/hypertune/example/record.csv) as a reference. + +Hypertune can also optimize multi-objective function. Add as many objectives as you would like to your script. diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/int8_overview.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/int8_overview.md.txt new file mode 100644 index 000000000..b0d279650 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/int8_overview.md.txt @@ -0,0 +1,146 @@ +Intel® Extension for PyTorch\* optimizations for quantization +============================================================= + +The quantization functionality in Intel® Extension for PyTorch\* currently only supports post-training quantization. This tutorial introduces how the quantization works in the Intel® Extension for PyTorch\* side. + +We fully utilize Pytorch quantization components as much as possible, such as PyTorch [Observer method](https://pytorch.org/docs/1.11/quantization-support.html#torch-quantization-observer). To make a PyTorch user be able to easily use the quantization API, API for quantization in Intel® Extension for PyTorch\* is very similar to those in PyTorch. Intel® Extension for PyTorch\* quantization supports a default recipe to automatically decide which operators should be quantized or not. This brings a satisfying performance and accuracy tradeoff. + +## Static Quantization + +```python +import intel_extension_for_pytorch as ipex +from intel_extension_for_pytorch.quantization import prepare, convert +``` + +### Define qconfig + +Using the default qconfig(recommended): + +```python +qconfig = ipex.quantization.default_static_qconfig +# equal to +# QConfig(activation=HistogramObserver.with_args(reduce_range=False), +# weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)) +``` + +or define your own qconfig as: + +```python +from torch.ao.quantization import MinMaxObserver, PerChannelMinMaxObserver, QConfig +qconfig = QConfig(activation=MinMaxObserver.with_args(qscheme=torch.per_tensor_affine, dtype=torch.quint8), + weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)) +``` + +Note: we fully use PyTorch [observer methonds](https://pytorch.org/docs/stable/quantization-support.html#torch-quantization-observer), so you can use a different PyTorch obsever methond to define the [QConfig](https://pytorch.org/docs/1.11/generated/torch.quantization.qconfig.QConfig.html). For weight observer, we only support **torch.qint8** dtype now. + +**Suggestion**: + +1. For activation observer, if using **qscheme** as **torch.per_tensor_affine**, **torch.quint8** is preferred. If using **qscheme** as **torch.per_tensor_symmetric**, **torch.qint8** is preferred. For weight observer, setting **qscheme** to **torch.per_channel_symmetric** can get a better accuracy. +2. If your CPU device doesn't support VNNI, seting the observer's **reduce_range** to **True** can get a better accuracy, such as skylake. + +### Prepare Model and Do Calibration + +```python +# prepare model, do conv+bn folding, and init model quant_state. +user_model = ... +user_model.eval() +example_inputs = .. +prepared_model = prepare(user_model, qconfig, example_inputs=example_inputs, inplace=False) + +for x in calibration_data_set: + prepared_model(x) + +# Optional, if you want to tuning(performance or accuracy), you can save the qparams as json file which +# including the quantization state, such as scales, zero points and inference dtype. +# And then you can achange the json file's settings, loading the changed json file +# to model which will override the model's original quantization's settings. +# +# prepared_model.save_qconf_summary(qconf_summary = "configure.json") +# prepared_model.load_qconf_summary(qconf_summary = "configure.json") +``` + +### Convert to Static Quantized Model and Deploy + +```python +# make sure the example_inputs's size is same as the real input's size +convert_model = convert(prepared_model) +with torch.no_grad(): + traced_model = torch.jit.trace(convert_model, example_input) + traced_model = torch.jit.freeze(traced_model) +# for inference +y = traced_model(x) + +# or save the model to deploy + +# traced_model.save("quantized_model.pt") +# quantized_model = torch.jit.load("quantized_model.pt") +# quantized_model = torch.jit.freeze(quantized_model.eval()) +# ... +``` + +## Dynamic Quantization + +```python +import intel_extension_for_pytorch as ipex +from intel_extension_for_pytorch.quantization import prepare, convert +``` + +### Define QConfig + +Using the default qconfig(recommended): + +```python +dynamic_qconfig = ipex.quantization.default_dynamic_qconfig +# equal to +# QConfig(activation=PlaceholderObserver.with_args(dtype=torch.float, is_dynamic=True), +# weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)) +``` + +or define your own qconfig as: + +```python +from torch.ao.quantization import MinMaxObserver, PlaceholderObserver, QConfig +dynamic_qconfig = QConfig(activation = PlaceholderObserver.with_args(dtype=torch.float, is_dynamic=True), + weight = MinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric)) +``` + +Note: For weight observer, it only supports dtype **torch.qint8**, and the qscheme can only be **torch.per_tensor_symmetric** or **torch.per_channel_symmetric**. For activation observer, it only supports dtype **torch.float**, and the **compute_dtype** can be **torch.quint8** or **torch.qint8**. + +**Suggestion**: + +1. For weight observer, setting **qscheme** to **torch.per_channel_symmetric** can get a better accuracy. +2. If your CPU device doesn't support VNNI, setting the observer's **reduce_range** to **True** can get a better accuracy, such as skylake. + +### Prepare Model + +```python +prepared_model = prepare(user_model, dynamic_qconfig, example_inputs=example_inputs) +``` + +## Convert to Dynamic Quantized Model and Deploy + +```python +# make sure the example_inputs's size is same as the real input's size +convert_model = convert(prepared_model) +# Optional: convert the model to traced model +#with torch.no_grad(): +# traced_model = torch.jit.trace(convert_model, example_input) +# traced_model = torch.jit.freeze(traced_model) + +# or save the model to deploy +# traced_model.save("quantized_model.pt") +# quantized_model = torch.jit.load("quantized_model.pt") +# quantized_model = torch.jit.freeze(quantized_model.eval()) +# ... +# for inference +y = convert_model(x) +``` + +Note: we only support the following ops to do dynamic quantization: + +- torch.nn.Linear +- torch.nn.LSTM +- torch.nn.GRU +- torch.nn.LSTMCell +- torch.nn.RNNCell +- torch.nn.GRUCell diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/int8_recipe_tuning_api.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/int8_recipe_tuning_api.md.txt new file mode 100644 index 000000000..e3ea2a698 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/int8_recipe_tuning_api.md.txt @@ -0,0 +1,222 @@ +INT8 Recipe Tuning API (Prototype) +===================================== + +This [new API](../api_doc.html#ipex.quantization.autotune) `ipex.quantization.autotune` supports INT8 recipe tuning by using Intel® Neural Compressor as the backend in Intel® Extension for PyTorch\*. In general, we provid default recipe in Intel® Extension for PyTorch\*, and we still recommend users to try out the default recipe first without bothering tuning. If the default recipe doesn't bring about desired accuracy, users can use this API to tune for a more advanced receipe. + +Users need to provide a fp32 model and some parameters required for tuning. The API will return a prepared model with tuned qconfig loaded. + +## Usage Example +- Static Quantization +Please refer to [static_quant example](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/features/int8_recipe_tuning/imagenet_autotune.py). + +- Smooth Quantization +Please refer to [LLM SmoothQuant example](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/llm/inference/single_instance/run_quantization.py). + +## Smooth Quantization Autotune +### Algorithm: Auto-tuning of $\alpha$. +SmoothQuant method aims to split the quantization difficulty of weight and activation by using a fixed-value $\alpha$ for an entire model. However, as the distributions of activation outliers vary not only across different models but also across different layers within a model, we hereby propose a method to obtain layer-wise optimal $\alpha$ values with the ability to tune automatically. +Currently, both layer-wise and block-wise auto-tuning methods are supported and the default option is layer-wise. +In block-wise auto-tuning, layers within one block (e.g an OPTDecoderLayer) would share the same alpha value; users could set *'do_blockwise': True* in *auto_alpha_args* to enable it. + +Our proposed method consists of 8 major steps: + +- Hook input minimum and maximum values of layers to be smoothed using register_forward_hook. +- Find a list of layers on which smoothquant could be performed. +- Generate a list of $\alpha$ values of a user-defined range and set a default $\alpha$ value. +- Calculate smoothing factor using default $\alpha$ value, adjust parameters accordingly and forward the adjusted model given an input sample. +- Perform per-channel quantization_dequantization of weights and per-tensor quantization_dequantization of activations to predict output. +- Calculate the layer-wise/block-wise loss with respect to FP32 output, iterate the previous two steps given each $\alpha$ value and save the layer-wise/block-wise loss per alpha. +- Apply criterion on input LayerNorm op and obtain the optimal alpha values of a single input sample. +- Iterate the previous three steps over a number of input samples and save the layer-wise/block-wise optimal $\alpha$ values. + +Multiple criteria (e.g min, max and mean) are supported to determine the $\alpha$ value of an input LayerNorm op of a transformer block. Both alpha range and criterion could be configured in auto_alpha_args. + +In our experiments, an $\alpha$ range of [0.0, 1.0] with a step_size of 0.1 is found to be well-balanced one for the majority of models. + +### $\alpha$ Usage +There are two ways to apply smooth quantization: 1) using a fixed `alpha` for the entire model or 2) determining the `alpha` through auto-tuning. + +#### Using a fixed `alpha` +To set a fixed alpha for the entire model, users can follow this example: +```python +import intel_extension_for_pytorch as ipex +smoothquant_args: {"alpha": 0.5, "folding": True} +tuned_model = ipex.quantization.autotune( + model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args, +) +``` +`smoothquant_args` description: +"alpha": a float value. Default is 0.5. +"folding": whether to fold mul into the previous layer, where mul is required to update the input distribution during smoothing. +- True: Fold inserted `mul` into the previous layer in the model graph. IPEX will only insert `mul` for layers that can do folding. +- False: Allow inserting `mul` to update the input distribution without folding in the graph explicitly. IPEX (version>=2.1) will fuse inserted `mul` automatically in the backend. + +#### Determining the `alpha` through auto-tuning +Users can search for the best `alpha` at two levels: a) for the entire model, and b) for each layer/block. + +1. Auto-tune the `alpha` for the entire model +The tuning process looks for the optimal `alpha` value from a list of `alpha` values provided by the user. +> Please note that, it may use a considerable amount of time as the tuning process applies each `alpha` to the entire model and uses the evaluation result on the entire dataset as the metric to determine the best `alpha`. +Here is an example: + +```python +import numpy as np +smoothquant_args={"alpha": numpy.arange(0.0, 1.0, 0.1).tolist()} +``` +2. Auto-tune the `alpha` for each layer/block +In this case, the tuning process searches the optimal `alpha` of each layer of the block by evaluating the loss with respect to FP32 output on a few batches of data. +Here is an example: + +```python +smoothquant_args={ + "alpha": "auto", + "auto_alpha_args"{ + "init_alpha": 0.8, # baseline alpha-value for auto-tuning + "alpha_min": 0.8, # min value of auto-tuning alpha search space + "alpha_max": 0.99, # max value of auto-tuning alpha search space + "alpha_step": 0.01, # step_size of auto-tuning alpha search space + "shared_criterion": "mean", # criterion for input LayerNorm op of a transformer block + "enable_blockwise_loss": False, # whether to enable block-wise auto-tuning + } +} +``` + +[//]: # (marker_feature_int8_autotune) +```python +import torch +from torch import nn +from torch.utils.data import DataLoader +from torchvision import datasets +from torchvision.transforms import ToTensor + +import intel_extension_for_pytorch as ipex + +######################################################################## # noqa F401 +# Reference for training portion: +# https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html + +# Download training data from open datasets. +training_data = datasets.FashionMNIST( + root="data", + train=True, + download=True, + transform=ToTensor(), +) + +# Download test data from open datasets. +test_data = datasets.FashionMNIST( + root="data", + train=False, + download=True, + transform=ToTensor(), +) +batch_size = 64 + +# Create data loaders. +train_dataloader = DataLoader(training_data, batch_size=batch_size) +test_dataloader = DataLoader(test_data, batch_size=1) + +for X, y in test_dataloader: + print(f"Shape of X [N, C, H, W]: {X.shape}") + print(f"Shape of y: {y.shape} {y.dtype}") + break + + +# Define model +class NeuralNetwork(nn.Module): + def __init__(self): + super().__init__() + self.flatten = nn.Flatten() + self.linear_relu_stack = nn.Sequential( + nn.Linear(28 * 28, 512), + nn.ReLU(), + nn.Linear(512, 512), + nn.ReLU(), + nn.Linear(512, 10), + ) + + def forward(self, x): + x = self.flatten(x) + logits = self.linear_relu_stack(x) + return logits + + +model = NeuralNetwork() +loss_fn = nn.CrossEntropyLoss() +optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) + + +def train(dataloader, model, loss_fn, optimizer): + size = len(dataloader.dataset) + model.train() + for batch, (X, y) in enumerate(dataloader): + + # Compute prediction error + pred = model(X) + loss = loss_fn(pred, y) + + # Backpropagation + optimizer.zero_grad() + loss.backward() + optimizer.step() + + if batch % 100 == 0: + loss, current = loss.item(), batch * len(X) + print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]") + + +model, optimizer = ipex.optimize(model, optimizer=optimizer) + +epochs = 5 +for t in range(epochs): + print(f"Epoch {t+1}\n-------------------------------") + train(train_dataloader, model, loss_fn, optimizer) +print("Done!") + +################################ QUANTIZE ############################## # noqa F401 +model.eval() + + +def evaluate(dataloader, model): + size = len(dataloader.dataset) + model.eval() + accuracy = 0 + with torch.no_grad(): + for X, y in dataloader: + # X, y = X.to('cpu'), y.to('cpu') + pred = model(X) + accuracy += (pred.argmax(1) == y).type(torch.float).sum().item() + accuracy /= size + return accuracy + + +######################## recipe tuning with INC ######################## # noqa F401 +def eval(prepared_model): + accu = evaluate(test_dataloader, prepared_model) + return float(accu) + + +tuned_model = ipex.quantization.autotune( + model, + test_dataloader, + eval_func=eval, + sampling_sizes=[100], + accuracy_criterion={"relative": 0.01}, + tuning_time=0, +) +######################################################################## # noqa F401 + +# run tuned model +data = torch.randn(1, 1, 28, 28) +convert_model = ipex.quantization.convert(tuned_model) +with torch.no_grad(): + traced_model = torch.jit.trace(convert_model, data) + traced_model = torch.jit.freeze(traced_model) + traced_model(data) + +# save tuned qconfig file +tuned_model.save_qconf_summary(qconf_summary="tuned_conf.json") + +print("Execution finished") +``` +[//]: # (marker_feature_int8_autotune) diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/isa_dynamic_dispatch.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/isa_dynamic_dispatch.md.txt new file mode 100644 index 000000000..b4f538dd0 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/isa_dynamic_dispatch.md.txt @@ -0,0 +1,483 @@ +ISA Dynamic Dispatching +======================= + +This document explains the dynamic kernel dispatch mechanism for Intel® Extension for PyTorch\* (IPEX) based on CPU ISA. It is an extension to the similar mechanism in PyTorch. + +## Overview + +IPEX dyndisp is forked from **PyTorch:** `ATen/native/DispatchStub.h` and `ATen/native/DispatchStub.cpp`. IPEX adds additional CPU ISA level support, such as `AVX512_VNNI`, `AVX512_BF16` and `AMX`. + +PyTorch & IPEX CPU ISA support statement: + + | | DEFAULT | AVX2 | AVX2_VNNI | AVX512 | AVX512_VNNI | AVX512_BF16 | AMX | AVX512_FP16 + | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | + | PyTorch | ✔ | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | ✘ | + | IPEX-1.11 | ✘ | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | ✘ | + | IPEX-1.12 | ✘ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | ✘ | + | IPEX-1.13 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | + | IPEX-2.1 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | + | IPEX-2.2 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | + | IPEX-2.3 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | + +\* Current IPEX DEFAULT level implemented as same as AVX2 level. + +### CPU ISA build compiler requirement + | ISA Level | GCC requirement | + | ---- | ---- | + | AVX2 | Any | + | AVX512 | GCC 9.2+ | + | AVX512_VNNI | GCC 9.2+ | + | AVX512_BF16 | GCC 10.3+ | + | AVX2_VNNI | GCC 11.2+ | + | AMX | GCC 11.2+ | + | AVX512_FP16 | GCC 12.1+ | + +\* Check with `cmake/Modules/FindAVX.cmake` for detailed compiler checks. + +## Dynamic Dispatch Design + +Dynamic dispatch copies the kernel implementation source files to multiple folders for each ISA level. It then builds each file using its ISA specific parameters. Each generated object file will contain its function body (**Kernel Implementation**). + +Kernel Implementation uses an anonymous namespace so that different CPU versions won't conflict. + +**Kernel Stub** is a "virtual function" with polymorphic kernel implementations pertaining to ISA levels. + +At the runtime, **Dispatch Stub implementation** will check CPUIDs and OS status to determins which ISA level pointer best matches the function body. + +### Code Folder Struct +>#### **Kernel implementation:** `csrc/cpu/aten/kernels/xyzKrnl.cpp` +>#### **Kernel Stub:** `csrc/cpu/aten/xyz.cpp` and `csrc/cpu/aten/xyz.h` +>#### **Dispatch Stub implementation:** `csrc/cpu/dyndisp/DispatchStub.cpp` and `csrc/cpu/dyndisp/DispatchStub.h` + +### CodeGen Process +IPEX build system will generate code for each ISA level with specifiy complier parameters. The CodeGen script is located at `cmake/cpu/IsaCodegen.cmake`. + +The CodeGen will copy each cpp files from **Kernel implementation**, and then add ISA level as new file suffix. + +> **Sample:** +> +> ---- +> +> **Origin file:** +> +> `csrc/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp` +> +> **Generate files:** +> +> DEFAULT: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.DEFAULT.cpp -O3 -D__AVX__ -DCPU_CAPABILITY_AVX2 -mavx2 -mfma -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -DCPU_CAPABILITY=DEFAULT -DCPU_CAPABILITY_DEFAULT` +> +> AVX2: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX2.cpp -O3 -D__AVX__ -mavx2 -mfma -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -DCPU_CAPABILITY=AVX2 -DCPU_CAPABILITY_AVX2` +> +> AVX512: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512.cpp -O3 -D__AVX512F__ -mavx512f -mavx512bw -mavx512vl -mavx512dq -mfma -DCPU_CAPABILITY=AVX512 -DCPU_CAPABILITY_AVX512` +> +> AVX512_VNNI: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_VNNI.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mfma -DCPU_CAPABILITY=AVX512_VNNI -DCPU_CAPABILITY_AVX512_VNNI` +> +> AVX512_BF16: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_BF16.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -DCPU_CAPABILITY=AVX512_BF16 -DCPU_CAPABILITY_AVX512_BF16` +> +> AMX: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AMX.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -DCPU_CAPABILITY=AMX -DCPU_CAPABILITY_AMX` +> +> AVX512_FP16: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_FP16.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -mavx512fp16 -DCPU_CAPABILITY_AMX -DCPU_CAPABILITY=AVX512_FP16 -DCPU_CAPABILITY_AVX512_FP16` +--- + +>**Note:** +>1. DEFAULT level kernels is not fully implemented in IPEX. In order to align to PyTorch, we build default use AVX2 parameters in stead of that. So, IPEX minimal required executing machine support AVX2. +>2. `-D__AVX__` and `-D__AVX512F__` is defined for depends library [sleef](https://sleef.org/) . +>3. `-DCPU_CAPABILITY_AVX512` and `-DCPU_CAPABILITY_AVX2` are must to be defined for **PyTorch:** `aten/src/ATen/cpu/vec`, it determins vec register width. +>4. `-DCPU_CAPABILITY=[ISA_NAME]` is must to be defined for **PyTorch:** `aten/src/ATen/cpu/vec`, it is used as inline namespace name. +>5. Higher ISA level is compatible to lower ISA levels, so it needs to contains level ISA feature definitions. Such as AVX512_BF16 need contains `-DCPU_CAPABILITY_AVX512` `-DCPU_CAPABILITY_AVX512_VNNI`. But AVX512 don't contains AVX2 definitions, due to there are different vec register width. + +## Add Custom Kernel + +If you want to add a new custom kernel, and the kernel uses CPU ISA instructions, refer to these tips: + +1. Add CPU ISA related kernel implementation to the folder: `csrc/cpu/aten/kernels/NewKernelKrnl.cpp` +2. Add kernel stub to the folder: `csrc/cpu/aten/NewKernel.cpp` +3. Include header file: `csrc/cpu/dyndisp/DispatchStub.h`, and reference to the comment in the header file. +```c++ +// Implements instruction set specific function dispatch. +// +// Kernels that may make use of specialized instruction sets (e.g. AVX2) are +// compiled multiple times with different compiler flags (e.g. -mavx2). A +// DispatchStub contains a table of function pointers for a kernel. At runtime, +// the fastest available kernel is chosen based on the features reported by +// cpuinfo. +// +// Example: +// +// In csrc/cpu/aten/MyKernel.h: +// using fn_type = void(*)(const Tensor& x); +// IPEX_DECLARE_DISPATCH(fn_type, stub); +// +// In csrc/cpu/aten/MyKernel.cpp +// IPEX_DEFINE_DISPATCH(stub); +// +// In csrc/cpu/aten/kernels/MyKernel.cpp: +// namespace { +// // use anonymous namespace so that different cpu versions won't conflict +// void kernel(const Tensor& x) { ... } +// } +// IPEX_REGISTER_DISPATCH(stub, &kernel); +// +// To call: +// stub(kCPU, tensor); +``` +4. Write the kernel follow the guide. It contains: declare function type, register stub, call stub, etc. + +>**Note:** +> +>1. Some kernels only call **oneDNN** or **iDeep** implementation, or other backend implementation, which is not needed to add kernel implementations. (Refer: `BatchNorm.cpp`) +>2. Vec related header file must be included in kernel implementation files, but can not be included in kernel stub. Kernel stub is common code for all ISA level, and can't pass ISA related compiler parameters. +>3. For more intrinsics, check the [Intel® Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html). + +### ISA intrinics specific kernel example: + +This is a FP32 convert to BF16 function example, and it is implemented for `AVX512_BF16`, `AVX512` and `DEFAULT` ISA levels. + +```c++ +//csrc/cpu/aten/CvtFp32ToBf16.h + +#pragma once + +#include + +namespace torch_ipex { +namespace cpu { + +void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len); + +namespace { + +void cvt_fp32_to_bf16_kernel_impl(at::BFloat16* dst, const float* src, int len); + +} + +using cvt_fp32_to_bf16_kernel_fn = void (*)(at::BFloat16*, const float*, int); +IPEX_DECLARE_DISPATCH(cvt_fp32_to_bf16_kernel_fn, cvt_fp32_to_bf16_kernel_stub); +} // namespace cpu +} // namespace torch_ipex + +``` +```c++ +//csrc/cpu/aten/CvtFp32ToBf16.cpp + +#include "CvtFp32ToBf16.h" + +namespace torch_ipex { +namespace cpu { + +IPEX_DEFINE_DISPATCH(cvt_fp32_to_bf16_kernel_stub); + +void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len) { + return cvt_fp32_to_bf16_kernel_stub(kCPU, dst, src, len); +} + +} // namespace cpu +} // namespace torch_ipex + +``` +Macro `CPU_CAPABILITY_AVX512` and `CPU_CAPABILITY_AVX512_BF16` are defined by compiler check, it is means that current compiler havs capability to generate defined ISA level code. + +Because of `AVX512_BF16` is higher level than `AVX512`, and it compatible to `AVX512`. `CPU_CAPABILITY_AVX512_BF16` can be contained in `CPU_CAPABILITY_AVX512` region. +```c++ +//csrc/cpu/aten/kernels/CvtFp32ToBf16Krnl.cpp + +#include +#include "csrc/aten/cpu/CvtFp32ToBf16.h" + +namespace torch_ipex { +namespace cpu { + +namespace { + +#if defined(CPU_CAPABILITY_AVX512) +#include +#else +#include +#endif +using namespace at::vec; + +#if defined(CPU_CAPABILITY_AVX512) +#include + +inline __m256i _cvt_fp32_to_bf16(const __m512 src) { +#if (defined CPU_CAPABILITY_AVX512_BF16) // AVX512_BF16 ISA implementation. + return reinterpret_cast<__m256i>(_mm512_cvtneps_pbh(src)); +#else // AVX512 ISA implementation. + __m512i value = _mm512_castps_si512(src); + __m512i nan = _mm512_set1_epi32(0xffff); + auto mask_value = _mm512_cmp_ps_mask(src, src, _CMP_ORD_Q); + __m512i ones = _mm512_set1_epi32(0x1); + __m512i vec_bias = _mm512_set1_epi32(0x7fff); + // uint32_t lsb = (input >> 16) & 1; + auto t_value = _mm512_and_si512(_mm512_srli_epi32(value, 16), ones); + // uint32_t rounding_bias = 0x7fff + lsb; + t_value = _mm512_add_epi32(t_value, vec_bias); + // input += rounding_bias; + t_value = _mm512_add_epi32(t_value, value); + // input = input >> 16; + t_value = _mm512_srli_epi32(t_value, 16); + // Check NaN before converting back to bf16 + t_value = _mm512_mask_blend_epi32(mask_value, nan, t_value); + return _mm512_cvtusepi32_epi16(t_value); +#endif +} + +void cvt_fp32_to_bf16_kernel_impl( + at::BFloat16* dst, + const float* src, + int len) { + int i = 0; + for (; i < len - 15; i += 16) { + auto f32 = _mm512_loadu_ps(src + i); + _mm256_storeu_si256((__m256i*)(dst + i), _cvt_fp32_to_bf16(f32)); + } + if (i < len) { + auto mask = (1 << (len - i)) - 1; + auto f32 = _mm512_maskz_loadu_ps(mask, src + i); + _mm256_mask_storeu_epi16(dst + i, mask, _cvt_fp32_to_bf16(f32)); + } +} + +#else // DEFAULT ISA implementation. + +void cvt_fp32_to_bf16_kernel_impl( + at::BFloat16* dst, + const float* src, + int len) { + for (int j = 0; j < len; j++) { + *(dst + j) = *(src + j); + } +} + +#endif + +} // anonymous namespace + +IPEX_REGISTER_DISPATCH(cvt_fp32_to_bf16_kernel_stub, &cvt_fp32_to_bf16_kernel_impl); + +} // namespace cpu +} // namespace torch_ipex + +``` + +### Vec specific kernel example: +This example shows how to get the data type size and its Vec size. In different ISA, Vec has a different register width and a different Vec size. + +```c++ +//csrc/cpu/aten/GetVecLength.h +#pragma once + +#include + +namespace torch_ipex { +namespace cpu { + +std::tuple get_cpp_typesize_and_vecsize(at::ScalarType dtype); + +namespace { + +std::tuple get_cpp_typesize_and_vecsize_kernel_impl( + at::ScalarType dtype); +} + +using get_cpp_typesize_and_vecsize_kernel_fn = + std::tuple (*)(at::ScalarType); +IPEX_DECLARE_DISPATCH( + get_cpp_typesize_and_vecsize_kernel_fn, + get_cpp_typesize_and_vecsize_kernel_stub); + +} // namespace cpu +} // namespace torch_ipex + +``` + +```c++ +//csrc/cpu/aten/GetVecLength.cpp + +#include "GetVecLength.h" + +namespace torch_ipex { +namespace cpu { + +IPEX_DEFINE_DISPATCH(get_cpp_typesize_and_vecsize_kernel_stub); + +// get cpp typesize and vectorsize by at::ScalarType +std::tuple get_cpp_typesize_and_vecsize(at::ScalarType dtype) { + return get_cpp_typesize_and_vecsize_kernel_stub(kCPU, dtype); +} + +} // namespace cpu +} // namespace torch_ipex + +``` + +```c++ +//csrc/cpu/aten/kernels/GetVecLengthKrnl.cpp + +#include +#include "csrc/cpu/aten/GetVecLength.h" + +namespace torch_ipex { +namespace cpu { + +namespace { + +std::tuple get_cpp_typesize_and_vecsize_kernel_impl( + at::ScalarType dtype) { + switch (dtype) { + case at::ScalarType::Double: + return std::make_tuple( + sizeof(double), at::vec::Vectorized::size()); + case at::ScalarType::Float: + return std::make_tuple(sizeof(float), at::vec::Vectorized::size()); + case at::ScalarType::ComplexDouble: + return std::make_tuple( + sizeof(c10::complex), + at::vec::Vectorized>::size()); + case at::ScalarType::ComplexFloat: + return std::make_tuple( + sizeof(c10::complex), + at::vec::Vectorized>::size()); + case at::ScalarType::BFloat16: + return std::make_tuple( + sizeof(decltype( + c10::impl::ScalarTypeToCPPType::t)), + at::vec::Vectorized::t)>::size()); + case at::ScalarType::Half: + return std::make_tuple( + sizeof(decltype( + c10::impl::ScalarTypeToCPPType::t)), + at::vec::Vectorized::t)>::size()); + default: + TORCH_CHECK( + false, + "Currently only floating and complex ScalarType are supported."); + } +} + +} // anonymous namespace + +IPEX_REGISTER_DISPATCH( + get_cpp_typesize_and_vecsize_kernel_stub, + &get_cpp_typesize_and_vecsize_kernel_impl); + +} // namespace cpu +} // namespace torch_ipex + +``` +## Private Debug APIs + +Here are three ISA-related private APIs that can help debugging:: +1. Query current ISA level. +2. Query max CPU supported ISA level. +3. Query max binary supported ISA level. +>**Note:** +> +>1. Max CPU supported ISA level only depends on CPU features. +>2. Max binary supported ISA level only depends on built complier version. +>3. Current ISA level, it is the smaller of `max CPU ISA level` and `max binary ISA level`. + +### Example: +```bash +python +Python 3.9.7 (default, Sep 16 2021, 13:09:58) +[GCC 7.5.0] :: Anaconda, Inc. on linux +Type "help", "copyright", "credits" or "license" for more information. +>>> import intel_extension_for_pytorch._C as core +>>> core._get_current_isa_level() +'AMX' +>>> core._get_highest_cpu_support_isa_level() +'AMX' +>>> core._get_highest_binary_support_isa_level() +'AMX' +>>> quit() +``` + +## Select ISA level manually. + +By default, IPEX dispatches to the kernels with the maximum ISA level supported by the underlying CPU hardware. This ISA level can be overridden by the environment variable `ATEN_CPU_CAPABILITY` (same environment variable as PyTorch). The available values are {`avx2`, `avx512`, `avx512_vnni`, `avx512_bf16`, `amx`, `avx512_fp16`}. The effective ISA level would be the minimal level between `ATEN_CPU_CAPABILITY` and the maximum level supported by the hardware. +### Example: +```bash +$ python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())' +AMX +$ ATEN_CPU_CAPABILITY=avx2 python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())' +AVX2 +``` +>**Note:** +> +>`core._get_current_isa_level()` is an IPEX internal function used for checking the current effective ISA level. It is used for debugging purpose only and subject to change. + +## CPU feature check + +An addtional CPU feature check tool in the subfolder: `tests/cpu/isa` + +```bash +$ cmake . +-- The C compiler identification is GNU 11.2.1 +-- The CXX compiler identification is GNU 11.2.1 +-- Detecting C compiler ABI info +-- Detecting C compiler ABI info - done +-- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/cc - skipped +-- Detecting C compile features +-- Detecting C compile features - done +-- Detecting CXX compiler ABI info +-- Detecting CXX compiler ABI info - done +-- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/c++ - skipped +-- Detecting CXX compile features +-- Detecting CXX compile features - done +-- Configuring done +-- Generating done +-- Build files have been written to: tests/cpu/isa +$ make +[ 33%] Building CXX object CMakeFiles/cpu_features.dir/intel_extension_for_pytorch/csrc/cpu/isa/cpu_feature.cpp.o +[ 66%] Building CXX object CMakeFiles/cpu_features.dir/intel_extension_for_pytorch/csrc/cpu/isa/cpu_feature_main.cpp.o +[100%] Linking CXX executable cpu_features +[100%] Built target cpu_features +$ ./cpu_features +XCR0: 00000000000602e7 +os --> avx: true +os --> avx2: true +os --> avx512: true +os --> amx: true +mmx: true +sse: true +sse2: true +sse3: true +ssse3: true +sse4_1: true +sse4_2: true +aes_ni: true +sha: true +xsave: true +fma: true +f16c: true +avx: true +avx2: true +avx_vnni: true +avx512_f: true +avx512_cd: true +avx512_pf: false +avx512_er: false +avx512_vl: true +avx512_bw: true +avx512_dq: true +avx512_ifma: true +avx512_vbmi: true +avx512_vpopcntdq: true +avx512_4fmaps: false +avx512_4vnniw: false +avx512_vbmi2: true +avx512_vpclmul: true +avx512_vnni: true +avx512_bitalg: true +avx512_fp16: true +avx512_bf16: true +avx512_vp2intersect: true +amx_bf16: true +amx_tile: true +amx_int8: true +prefetchw: true +prefetchwt1: false +``` \ No newline at end of file diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/nhwc.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/nhwc.md.txt new file mode 100644 index 000000000..21a611676 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/nhwc.md.txt @@ -0,0 +1,190 @@ +Channels Last +============= + +## What is Channels Last + +**Note**: In PyTorch, **memory format** refers to data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space. **Memory format** has the same semantic meaning as **layout** in oneDNN. **Layout** in PyTorch has other semantic of describing **dense** or **sparse** with the attributes: 'torch.strided', 'torch.sparse_coo'. + +On CNN models, the canonical order of tensor dimensions is assigned with semantic meaning. For example the input tensor of 2D convolution is of NCHW by default on PyTorch - . NHWC is an alternative way of describing the tensor dimensions - . + +Look at the following image of illustrating NCHW and NHWC when N=1. Actually when N=1, NHWC has the same format with BMP file image. +![fig-1-memory-layout](../../../images/channels_last/figure1_memory_layout.png) + +PyTorch refers to NCHW as `torch.contiguous_format` (the default memory format) and to NHWC as `torch.channels_last`, which is a new feature as of the 1.5 release. + +TensorFlow uses NHWC as the default memory format because NHWC has a performance advantage over NCHW. On CPU platforms, we propose to optimize Channels Last memory path for the following reasons: +* **Performance** - NHWC performance is not as good as blocked memory format (nChw16c), but it is close, and much better performance than NCHW. +* **User Experience** - Operator coverage of NHWC would be higher than blocked memory format (`to_mkldnn()` method), so user experience is better. To be specific, it is difficult to enable operators that manipulates `dim` on blocked format such as `sum(dim=?)`. You would need to convert tensor from blocked memory format back to NHWC using `to_dense()`, before feeding it into `sum()`. This is naturally supported on Channels Last memory format already. +* **Upstream** - Will be easier since CPU doesn't hold secret ingredient and both inference and training will be covered. + +## Memory Format Is All That Matters + +On CNN models, memory format is almost the foundation of any upper level design. One important fact is that converting memory format could be very expensive. Thus, in case that multiple CNN operators are performed in sequence, e.g. `Conv2d -> ReLU -> Conv2d`, it's beneficial to transform them from different memory formats once, do computation and reorder them back. + +On PyTorch, you can use 3 types of memory formats on CNN models: + +### a. NCHW (default) + +```python +## NB: internally blocked format will still be used. +## aka. we do 'reorder' for 'input', 'weight' and 'output', +## and believe me this is expensive, roughly 50% perf loss... +input = torch.randn(1, 10, 32, 32) +model = torch.nn.Conv2d(10, 20, 1, 1) +output = model(input) +``` + +### b. NHWC (WIP for CPU) + +```python +input = torch.randn(1, 10, 32, 32) +model = torch.nn.Conv2d(10, 20, 1, 1) +## NB: convert to Channels Last memory format. +## oneDNN supports NHWC for feature maps (input, output), +## but weight still needs to be of blocked format. +## Still we can save reorders for feature maps. +input = input.to(memory_format=torch.channels_last) +model = model.to(memory_format=torch.channels_last) +output = model(input) +``` + +### c. Blocked (nChw16c) + +```python +from torch.utils import mkldnn as mkldnn_utils +input = torch.randn(1, 10, 32, 32) +model = torch.nn.Conv2d(10, 20, 1, 1) +## NB: convert to blocked memory format. +## Note that 'output' is in blocked memory format, +## in case the subsequent operator doesn't support blocked memory format +## you need to manually reorder it back to NCHW by output.to_dense() +## mkldnn_utils.to_mkldnn(model) is used to prepack the weight, this will save weight reorder time +## for inference. For training, it is not needed. +input = input.to_mkldnn() +model = mkldnn_utils.to_mkldnn(model) +output = model(input) +``` + +Better to explain the concepts here with a diagram, the **dotted lines** indicate simple memory view, no hard copy. +![fig-2(1)-pt-conv-layout-path-dispatch](../../../images/channels_last/figure2_dispatch.png) + +**Conclusion** is that NHWC path saves the reorders from feature maps compared with NCHW path, but still weight reorder is necessary since oneDNN requires weights to be in blocked memory format. From performance perspective, when `batch_size=N`, weight reorder is minimum compared to feature map reorder. But when `batch_size=1`, weight reorder is usually not negligible. So whether to enable weight prepacking on channels last memory format needs further discussion. + +## PyTorch Strided Layout + +Before moving on, I feel it is necessary to explain how PyTorch organizes tensors in memory - the **layout**. Here we only focus on **dense** tensors, skip 'coo' layout of **sparse** tensor. + +The question itself can be reinterpreted as for a tensor of size , how does PyTorch access the element with index from memory, the answer is **stride**: +``` +tensor: +index: +strides: +offset(n,c,h,w) = stride_n * n + stride_c * c + stride_h * h + stride_w * w + = CHW * n + HW * c + W * h + 1 * w +``` + +One merit of introducing **stride** is that it can express noncontiguous tensors, e.g. a slice of big tensor. For example, the 'Xs' in the following image have a stride of . + +![fig-3-pytorch-strided-layout](../../../images/channels_last/figure3_strided_layout.png) + +Keep in mind that PyTorch Tensor does not have an attribute called 'memory_format' or something else. The memory format expression completely relies on **size** and **stride**. The design principle can be found at reference: [RFC: Memory format (aka layout aka NHWC) support](https://github.com/pytorch/pytorch/issues/19092). No matter what the tensor's memory format is, we need a logical canonical order for the dimensions - that is **NCHW** on PyTorch. Thus, **size** and **stride** are ALWAYS described in the order of **NCHW**. Let's now look at the Channels Last case of the previous question: +``` +tensor: +index: +strides: +offset(n,c,h,w) = stride_n * n + stride_c * c + stride_h * h + stride_w * w + = HWC * n + 1 * c + WC * h + C * w +``` + +Actually, this pattern applies to ALL other memory formats as long as it is 4-dim, e.g. strides for CHWN would be <1, HWN, WN, N>. + +## PyTorch Channels Last Memory Format APIs + +### a. tensor creation +```python +x = torch.empty(N, C, H, W, memory_format=torch.channels_last) +``` + +### b. tensor conversion +```python +## .contiguous() transforms NHWC noncontiguous to NHWC contiguous. +## .to() converts NCHW tensor to NHWC one, it is outplace. +x = x.contiguous(memory_format=torch.channels_last) +x = x.to(memory_format=torch.channels_last) + +## contiguous check +x.is_contiguous(memory_format=torch.channels_last) +``` + +### c. model conversion +```python +## NB: tensor.to() is an outplace operation +## model.to() is inplace. It calls _apply() which is inplace. +model = model.to(memory_format=torch.channels_last) +input = input.to(memory_format=torch.channels_last) +``` + +### d. operator coverage + +Detailed operator coverage information has been listed at reference [Operators-with-Channels-Last-support](https://github.com/pytorch/pytorch/wiki/Operators-with-Channels-Last-support). In brief, ImageNet training topologies on GPU already have full support on Channels Last memory format, while CPU doesn't. + +Some spontaneous questions: +* **How to tell whether this model or operator support Channels Last?** - This requires manual memory format check, aka. 'torch.channels_last' input and weight shall NOT generate 'torch.contiguous_format' output. +* **What if the model comprises of operator not supported Channels Last?** - No errors messages will be shown, the NHWC tensor will be handled by the operator as a non-contiguous NCHW tensor, so result might not be correct depending on the algorithm of this operator. + +## Writing Channels Last Kernels + +### a. Status on CPU + +* **No support** - Requires to register Channels Last kernel for CPU path, e.g. Conv2d; +* **Explicit support** - Already have Channels Last kernel for CPU path (in ATen native manner), need to compare oneDNN counterpart performance, e.g. BatchNorm; +* **Implicit support** - Supported via meta structures like 'TensorIterator', need to compare oneDNN counterpart performance, e.g. ReLU. + +### b. Register Channels Last Kernel in ATen Native Manner + +The general guideline has been listed under reference [Writing-memory-format-aware-operators](https://github.com/pytorch/pytorch/wiki/Writing-memory-format-aware-operators), not to repeat here. You may take one of my recent PR [optimize upsample performance linear mode on CPU](https://github.com/pytorch/pytorch/pull/34864) as an example, which also demonstrates NHWC performance advantage over NCHW because of the ease of vectorization. + +### c. Register oneDNN Kernel on Channels Last + +Registering a oneDNN kernel under Channels Last memory format on CPU is no different from [cuDNN](https://github.com/pytorch/pytorch/pull/23861): Only very few upper level changes are needed, such as accommodate 'contiguous()' to 'contiguous(suggested_memory_format)'. The automatic reorder of oneDNN weight shall have been hidden in ideep. + +## oneDNN NHWC APIs + +Compared to NCHW interfaces, 2 parts need to be addressed on NHWC interfaces: + +### a. Create NHWC Memory + +The logical size and stride description of oneDNN is always in NCHW, this is identical to PyTorch. Example code such as +```cpp +/* create md from memory::format_tag */ +auto src_md = memory::desc( + {N, C, H, W}, // logical dims, the order is defined by a primitive + memory::data_type::f32, // tensor's data type + memory::format_tag::nhwc // memory format, NHWC in this case +); + +/* alternative: create md from strides */ +auto src_md = memory::desc( + {N, C, H, W}, // logical dims, the order is defined by a primitive + memory::data_type::f32, // tensor's data type + {stride_N, stride_C, stride_H, stride_W} // the strides +); + +/* create memory */ +auto src_mem = memory(src_md, src_data_ptr, engine); +``` + +### b. Create Convolution Primitive + +* **NCHW** - create `memory::desc` with *any* card for 'input', 'output' and 'weight'; query proposed `memory::desc` from convolution primitive; +* **NHWC** - create `memory::desc` with `format_tag::nhwc` for 'input' and 'output', use *any* for 'weight'; if we use `hwio` for 'weight' convolution primitive will be created with gemm rather jit avx512. + +## CPU Channels Last Targets + +* **User Experience** - No special user level code change, only 'input' and 'model' conversion is required; +* **Scenarios** - cover both training and inference; +* **Models** - ResNet50 and ResNext101, extended targets: torchvision models, detectron2; +* **Performance Targets** - training >0.8x blocked; inference throughput > 0.8x blocked; inference latency? (need further discussion) +* **Operator Coverage** - No less than GPU path; +* **BFloat16** - This part shall align with BFloat16 integration (need further discussion); +* **int8** - Need further discussion. diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/optimizer_fusion.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/optimizer_fusion.md.txt new file mode 100644 index 000000000..6f8f722e2 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/optimizer_fusion.md.txt @@ -0,0 +1,36 @@ +Optimizer Fusion +================ + +## Introduction +As with TorchScript, operation fusion reduces the number of operators that will be executed, and reduces overhead time. This methodology is also applied in ipex optimizer Optimization. We support Lamb/Adagrad/SGD fusion for both FP32/BF16(Split) at current stage. + +Let's use [adagrad update](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html?highlight=adagrad#torch.optim.Adagrad) as an example. + +```python + if weight_decay != 0: + grad = grad.add(param, alpha=weight_decay) + clr = lr / (1 + (step - 1) * lr_decay) + state_sum.addcmul_(grad, grad, value=1) + std = state_sum.sqrt().add_(eps) + param.addcdiv_(grad, std, value=-clr) +``` + +## Operation Fusion + +One problem of the native implementation above is that we need to access the whole storage of "grad", "parameters", and "state sum" several times. For example, we need to access the whole storage of "parameters" and "grad" at the first clause. For large topologies, it is possible that the "grad" and "parameters" cannot be stored on the onboard CPU cache. When we need to access the storage of "grad" again when executing the third clause, the processor must read data out from memory again instead of the more efficient onboard high speed CPU cache. This is a memory-bound bottle neck preventing good performance. + +Fusion is the methodology to solve this problem. Since the 5 clauses in the pseudo code are all element-wise operations. We can fuse them into a single one, like the pseudo code below. + +```python + adagrad_fused_step(param, grad, state_sum, ...(other args)) +``` + + In our fused operators, we can separate the storage of "grad", "parameters" and "state sum" in several groups and ensure each group is small enough to be stored in the cache. The pseudo code below illustrates our execution process. + +```python + grad = (grad0, grad1, ..., grad_n) + param = (param, param, ..., param_n) + state_sum = (state_sum, state_sum, ..., state_sum_n) + for i in range(n): + adagrad_step(grad_i, param_i, state_sum_i, ...(other_args)) +``` diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/runtime_extension.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/runtime_extension.md.txt new file mode 100644 index 000000000..03f0e9f56 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/runtime_extension.md.txt @@ -0,0 +1,174 @@ +Runtime Extension +================= + +Intel® Extension for PyTorch\* Runtime Extension provides a couple of PyTorch frontend APIs for users to get finer-grained control of the thread runtime. It provides: + +1. Multi-stream inference via the Python frontend module `ipex.cpu.runtime.MultiStreamModule`. +2. Spawn asynchronous tasks via the Python frontend module `ipex.cpu.runtime.Task`. +3. Program core bindings for OpenMP threads via the Python frontend `ipex.cpu.runtime.pin`. + +**note**: Intel® Extension for PyTorch\* Runtime extension is in the **prototype** stage. The API is subject to change. More detailed descriptions are available at [API Documentation page](../api_doc.rst). + +## Requirements + +Intel® Extension for PyTorch\* Runtime Extension relies on `intel omp` to bind threads to cores. If you want to use it in your application, start model script with an extra flag: `LD_PRELOAD=$LD_PRELOAD:$PATH/libiomp5.so python model_script.py`. + +## Use Cases + +### Example of MultiStream Module + +Runtime extension supports weight-sharing multi-stream inference for throughput mode on CPU. You need to convert the original model into multi-stream model and run the new multi-stream model as normal. The detailed description of parameters to create `MultiStreamModule` is available at [API Documentation page](../api_doc.rst). + +`MultiStreamModule` can improve performance for inference in throughput mode. We suggest creating `MultiStreamModule` with `num_streams` of "AUTO", which heuristically decides the number of streams. Usually, it provides a reasonable performance. However, it may not be optimal for some cases (refer to the section [Performance recipes](#performance-recipes) for details). Manual tuning for number of streams is needed. + +The `MultiStreamModule` creates number of streams based on input parameter `num_streams` and bind cores to stream based on input parameter `cpu_pool`. If the number of cores inside `cpu_pool` is divisible by `num_streams`, the cores will be allocated equally to each stream. If the number of cores inside `cpu_pool` is not divisible by `num_streams` with remainder N, one extra core will be allocated to the first N streams. We suggest to set the `num_streams` as divisor of core number inside `cpu_pool`. + +If the inputs' batchsize is larger than and divisible by ``num_streams``, the batchsize will be allocated equally to each stream. If batchsize is not divisible by ``num_streams`` with remainder N, one extra piece will be allocated to the first N streams. If the inputs' batchsize is less than ``num_streams``, only the first batchsize's streams are used with mini batch as one. We suggest to set inputs' batchsize larger than and divisible by ``num_streams``. When creating `MultiStreamModule`, if you leave num of streams as "AUTO", we suggest to set inputs' batchsize larger than and divisible by number of cores. + +Let's create some ExampleNets that will be used by further examples: +``` +import torch +import intel_extension_for_pytorch as ipex + +class ExampleNet1(torch.nn.Module): + def __init__(self): + super(ExampleNet1, self).__init__() + self.conv = torch.nn.Conv2d(64, 128, (3, 3), stride=(2, 2), padding=(1, 1), bias=False) + + def forward(self, x): + x1 = self.conv(x) + y = torch.flatten(x1, start_dim=1) + return y + +class ExampleNet2(torch.nn.Module): + def __init__(self): + super(ExampleNet2, self).__init__() + self.conv = torch.nn.Conv2d(64, 128, (3, 3), stride=(2, 2), padding=(1, 1), bias=False) + self.conv2 = torch.nn.Conv2d(64, 128, (3, 3), stride=(2, 2), padding=(1, 1), bias=False) + + def forward(self, x1, x2): + y1 = self.conv(x1) + y2 = self.conv2(x2) + y = torch.flatten(y1, start_dim=1) + return y1, y + +model1 = ExampleNet1() +model1.eval() +x = torch.rand(16, 64, 3, 3) + +with torch.no_grad(): + traced_model1 = torch.jit.trace(model1, x) + traced_model1 = torch.jit.freeze(traced_model1) + +model2 = ExampleNet2() +model2.eval() +x2 = torch.rand(16, 64, 3, 3) + +with torch.no_grad(): + traced_model2 = torch.jit.trace(model2, (x, x2)) + traced_model2 = torch.jit.freeze(traced_model2) +``` + +#### Examples1: Basic Usage +Here is the example of a model with single tensor input/output. We create a CPUPool with all the cores available on numa node 0. And creating a `MultiStreamModule` with stream number of 2 to do inference. +``` +# Convert the model into multi_Stream_model +cpu_pool = ipex.cpu.runtime.CPUPool(node_id=0) +multi_Stream_model = ipex.cpu.runtime.MultiStreamModule(traced_model1, num_streams=2, cpu_pool=cpu_pool) + +with torch.no_grad(): + y = multi_Stream_model(x) +``` + +#### Examples2: Usage with "AUTO" setting +When creating a `MultiStreamModule`, we have default settings for `num_streams` ("AUTO") and `cpu_pool` (with all the cores available on numa node 0). For the `num_streams` of "AUTO", there are limitations to use with int8 datatype as we mentioned in below performance receipts section. +``` +# Convert the model into multi_Stream_model +multi_Stream_model = ipex.cpu.runtime.MultiStreamModule(traced_model1) + +with torch.no_grad(): + y = multi_Stream_model(x) +``` + +#### Examples3: Usage for models with structure inputs/outputs +For module such as ExampleNet2 with structure input/output tensors, user needs to create `MultiStreamModuleHint` as input hint and output hint. `MultiStreamModuleHint` tells `MultiStreamModule` how to auto split the input into streams and concat the output from each steam. +``` +# Convert the model into multi_Stream_model +cpu_pool = ipex.cpu.runtime.CPUPool(node_id=0) +# Create the input hint object +input_hint = ipex.cpu.runtime.MultiStreamModuleHint(0, 0) +# Create the output hint object +# When Python module has multi output tensors, it will be auto pack into a tuple, So we pass a tuple(0, 0) to create the output_hint +output_hint = ipex.cpu.runtime.MultiStreamModuleHint((0, 0)) +multi_Stream_model = ipex.cpu.runtime.MultiStreamModule(traced_model2, + num_streams=2, + cpu_pool=cpu_pool, + input_split_hint=input_hint, + output_concat_hint=output_hint) + +with torch.no_grad(): + y = multi_Stream_model(x, x2) +``` + +#### Performance recipes +There are two motivations to use the `MultiStreamModule`: +1. Better cache locality: With `MultiStreamModule`, the activations will be limited in the CPU cores allocated to this stream instead of the whole cpu_pool. +2. Reduce the OMP sync overhead: if one CPU core allocated to one stream, the whole execution needs to do OMP sync once after all streams finish execution instead of sync per layer. + +Thus, `MultiStreamModule` may benefit performance for inference in throughput mode. However, the end-to-end performance is impacted by these issues: +1. The kernels' efficiency, which are different under different OMP threads' number. +2. The overhead of inputs' auto split and outputs' auto concat for each stream. +3. The overhead of pthread (stream async execution) wakes up and threads' synchronization after stream execution. + +Here are some performance receipes that we recommend for better multi-stream performance. + +* When creating `MultiStreamModule` with `torch.nn.Module` as imperative path module, each stream inside `MultiStreamModule` suffers the GIL issue when doing inference together. This hurts end-to-end performance. We recommend creating `MultiStreamModule` with the `torch.jit.ScriptModule`. + +* For convolution network, `intel_extension_for_pytorch` has the quick path getting convolution primitive to mitigate overhead when `OMP_NUM_THREADS` is the same between the `torch.jit.trace` and model execution phases. To use this quick path for better performance, we recommend setting the `OMP_NUM_THREADS` environment before launching the model script. The recommended value of `OMP_NUM_THREADS` should equal the threads number used by each stream. For example, creating `MultiStreamModule` as stream number `s1` and CPUPool with core number `c1`, each stream will allocate threads number as `c1/s1`. We recommend setting `OMP_NUM_THREADS` as this value. + +* `Numactl` and the threads management in `MultiStreamModule` work at different levels. `MultiStreamModule` has the thread affinity setting for each stream, which works in the thread level. However, for the Python modules outside the stream, such as the dataloader, are out of view for `MultiStreamModule`. As the result, we recommend using `numactl -C core_ids -m node_id` for the process level core and memory resource management. For the core resource setting by `numactl`, set it the same or superset of the core resource to create `CPUPool`. Otherwise, the behavior is undefined in current implementation. + +#### Known issues +* Intel® Extension for PyTorch\* runtime extension feature with Int8 data type does not support dynamic shape well. To avoid performance issues, we recommend setting the batchsize to do `jit.trace` with same mini batchsize used by each stream. For example, creating `MultiStreamModule` as stream number of `s1` and input global batchsize as `gb`, each stream will inference with mini-batchsize of `gb/s1`. We should use this mini-batchsize value to do `jit.trace`. To be aware of the `num_streams` value, we recommend creating `MultiStreamModule` with `num_streams` setting explicitly instead of "AUTO". Due to the same limitation, the behavior that each stream inference with different mini batchsize of int8 data type is undefined and not supported. + +### Example of asynchronous task + +Here is an example for using asynchronous tasks. With the support of a runtime API, you can run 2 modules simultaneously. Each module runs on the corresponding cpu pool. + +``` +cpu_pool1 = ipex.cpu.runtime.CPUPool([0, 1, 2, 3]) +cpu_pool2 = ipex.cpu.runtime.CPUPool([4, 5, 6, 7]) + +task1 = ipex.cpu.runtime.Task(traced_model1, cpu_pool1) +task2 = ipex.cpu.runtime.Task(traced_model1, cpu_pool2) + +y1_future = task1(x) +y2_future = task2(x) + +y1 = y1_future.get() +y2 = y2_future.get() +``` + +### Example of configuring core binding + +Runtime Extension provides API of `ipex.cpu.runtime.pin` to a CPU Pool for binding physical cores. We can use it without the async task feature. Here is the example to use `ipex.cpu.runtime.pin` in the `with` context. + +``` +cpu_pool = ipex.cpu.runtime.CPUPool(node_id=0) +with ipex.cpu.runtime.pin(cpu_pool): + y_runtime = traced_model1(x) +``` + +## Detail Design + +### How the core binding is implemented + +The Runtime Extension relies on the `kmp_*` API inside `iomp` share library to fulfill the core binding. During the initialization of async threads, `kmp_*` API functions are invoked internally to start up an OpenMP group with specified number of worker threads. Each worker thread is then bound to the designated physical core(s) inside this OpenMP group. After initialization, when you submit a task, the OpenMP group will serve the requested task. + +### Design of Task + +Task is an abstraction of computation based on PyTorch module and is scheduled asynchronously. When a task is created with specific `nn.Module` or `jit module`, a sub-thread is initialized and bound to this task. During the initialization, an OpenMP worker group is created and bound to this sub-thread. After initialization, the sub-thread waits for input. When the main thread submits an input to this task, the sub-thread will wake up and execute the input. The main thread returns a `FutureTensor` and is not block until an explicit `FutureTensor.get()` is invoked to get the results executed in the sub-thread. + +### IOMP preload or load during the runtime + +Since Runtime Extension relies on the APIs from IOMP, we need to preload IOMP before executing the application. We want Intel® Extension for PyTorch\* built with Runtime API enabled. This means it should work fine without loading IOMP if the user didn't use the runtime API. Here we choose to `dlopen` IOMP library during runtime and we ensure the IOMP symbols are initialized once globally. diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/split_sgd.rst.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/split_sgd.rst.txt new file mode 100644 index 000000000..e9576611b --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/split_sgd.rst.txt @@ -0,0 +1,91 @@ +Split SGD +========= + +Both optimizations for inference workloads and training workloads are within Intel's optimization scope. Optimizations for train optimizer functions are an important perspective. The optimizations use a mechanism called **Split SGD** and take advantage of BFloat16 data type and operator fusion. Optimizer **adagrad**, **lamb** and **sgd** are supported. + +BFloat16 +-------- + +The figure below shows definition of Float32 (top) and `BFloat16 `_ (bottom) data types. Compared to Float32, BFloat16 is only half as long, and thus saves half the memory. It is supported natively at the instruction set level to boost deep learning workloads from the 3rd Generation of Intel® Xeon® Scalable Processors. It is compatible to Float32 since both have the same bit length for "sign" and "exponent" part. BFloat16 only has a 7-bit "mantissa" part while Float32 has 23 bits. BFloat16 has the same capacity to represent "digit ranges" with that of Float32, but has a shorter "precision" part. + +.. image:: https://user-images.githubusercontent.com/33838455/86600181-00f5c200-bfa0-11ea-93f0-95af3f0bff08.png + :width: 1200 + :align: center + :alt: Data types + +An advantage of BFloat16 is that it saves memory and reduces computation workload, but the fewer mantissa bits brings negative effects as well. Let's use an "ADD" operation as an example to explain the disadvantage. To perform addition of 2 floating point numbers, we need to shift the mantissa part of the numbers left or right to align their exponent parts. Since BFloat16 has a shorter mantissa part, it is much easier than Float32 to lose its mantissa part after the shifting, and thus cause an accuracy loss issue. + +Let's use the following two decimal numbers **x** and **y** as an example. We first do the calculation in a high precision data type (10 valid numbers after decimal point). + +.. math:: + + x &= 0.1234500000*10^{10} \\ + y &= 0.1234500000*10^{5} \\ + x+y &= 0.1234500000*10^{10} + 0.1234500000*10^{5} \\ + &= 0.1234500000*10^{10} + 0.0000012345*10^{10} \\ + & =0.1234512345*10^{10} + +This makes sense because after shifting **y** right by 5 digits, the fraction part is still there. + +Let's do the calculation using a low precision data type (5 valid numbers after decimal point): + +.. math:: + + x &= 0.12345*10^{10} \\ + y &= 0.12345*10^{5} \\ + x+y &= 0.12345*10^{10} + 0.12345*10^{5} \\ + &= 0.12345*10^{10} + 0.00000*10^{10} \\ + &= 0.12345*10^{10} + +Since the data type has only 5 digits for the fraction part, after shifting **y** by 5 digits, its fraction part is fully removed. This causes significant accuracy loss and, buy their nature, is a drawback of lower-precision data types. + +Stochastic Gradient Descent (SGD) +--------------------------------- + +Basically, training involves 3 steps: + +1. Forward propagation: Performance inference once and compare the results with ground truth to get loss number. +2. Backward propagation: Utilize chain rule to calculate gradients of parameters based on the loss number. +3. Parameter update: Update value of parameters by gradients along with calculated loss values. + +The training is actually a loop of these 3 steps in sequence until the loss number meets requirements or after a determined timeout duration. The Stochastic Gradient Descent (SGD) is most widely used at the 3rd step to update parameter values. To make it easy to understand, the 3rd step is described as the following formula: + +.. math:: + + W = W + α * gW + +Where :math:`W` denotes parameters to be updated. :math:`gW` denotes gradient received during backward propagation and :math:`α` denotes learning rate. + +Split SGD +--------- + +Since the addition applied in SGD is repeated, because of the low precision data loss mentioned earlier, if both the :math:`W` and :math:`gW` are stored in BFloat16 data type, we will most likely lose valid bits and make the training results inaccurate. Using FP32 master parameters is a common practice for avoiding the round-off errors at parameter update step. +To keep FP32 master parameters, we have 3 design choices: +1. Only save FP32 parameters: For this choice, we need introduce additional FP32->BF16 cast at each iter to get benefit from BF16 at forward and backward propagation steps. +2. Save both FP32 and BF16 parameters: BF16 parameter is used at forward and backward propagation steps. Use FP32 master parameter at update steps. For this choice we introduce more memory footprint. +3. "Split" choice: In order to get performance benefits with BFloat16 at forward and backward propagation steps, while avoiding increase the memory footprint, we propose the mechanism **"Split SGD"**. + +The idea is to "split" a 32-bit floating point number into 2 parts: + +1. Top half: First 16 bits can be viewed as exactly a BFloat16 number. +2. Bottom half: Last 16 bits are still kept to avoid accuracy loss. + +FP32 parameters are split into "Top half" and "Bottom half". When performing forward and backward propagations, the Top halves are used to take advantage of Intel BFloat16 support. When performing parameter update with SGD, we concatenate the Top half and the Bottom half to recover the parameters back to FP32 and then perform regular SGD operations. + +It is a common practice to use FP32 for master parameters in order to avoid round-off errors with BF16 parameter update. **SplitSGD** is an optimization of storing FP32 master parameters with reduced memory footprint. + +.. image:: ../../../images/split_sgd/split_sgd.png + :width: 800 + :align: center + :alt: Split SGD + +| + +The following pseudo code illustrates the process of Split SGD. + +.. code-block:: python + + fp32_w = concat_fp32_from_bf16(bf16_w, trail) + fp32_gw = bf16_gw.float() + fp32_w += α* fp32_gw (sgd step without weight_dacay, momentum) + bf16_w, trail = split_bf16_from_fp32(fp32_w) diff --git a/cpu/2.4.0+cpu/_sources/tutorials/features/sq_recipe_tuning_api.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/features/sq_recipe_tuning_api.md.txt new file mode 100644 index 000000000..1ebabfaaf --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/features/sq_recipe_tuning_api.md.txt @@ -0,0 +1,21 @@ +Smooth Quant Recipe Tuning API (Prototype) +============================================= + +Smooth Quantization is a popular method to improve the accuracy of int8 quantization. +The [autotune API](../api_doc.html#ipex.quantization.autotune) allows automatic global alpha tuning, and automatic layer-by-layer alpha tuning provided by Intel® Neural Compressor for the best INT8 accuracy. + +SmoothQuant will introduce alpha to calculate the ratio of input and weight updates to reduce quantization error. SmoothQuant arguments are as below: + +| Arguments | Default Value | Available Values | Comments | +|:----------------:|:-------------:|:---------------------:|:-----------------------------------------------------------:| +| alpha | 'auto' | [0-1] / 'auto' | value to balance input and weight quantization error. | +| init_alpha | 0.5 | [0-1] / 'auto' | value to get baseline quantization error for auto-tuning. | +| alpha_min | 0.0 | [0-1] | min value of auto-tuning alpha search space | +| alpha_max | 1.0 | [0-1] | max value of auto-tuning alpha search space | +| alpha_step | 0.1 | [0-1] | step_size of auto-tuning alpha search space | +| shared_criterion | "mean" | ["min", "mean","max"] | criterion for input LayerNorm op of a transformer block. | +| enable_blockwise_loss | False | [True, False] | whether to enable block-wise auto-tuning | + +Please refer to the [LLM examples](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/llm/inference) for complete examples. + +**Note**: When defining dataloaders for calibration, please follow INC's dataloader [format](https://github.com/intel/neural-compressor/blob/master/docs/source/dataloader.md). diff --git a/cpu/2.4.0+cpu/_sources/tutorials/getting_started.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/getting_started.md.txt new file mode 100644 index 000000000..67874f6d4 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/getting_started.md.txt @@ -0,0 +1,160 @@ +# Quick Start + +The following instructions assume you have installed the Intel® Extension for PyTorch\*. For installation instructions, refer to [Installation](../../../index.html#installation?platform=cpu&version=main). + +To start using the Intel® Extension for PyTorch\* in your code, you need to make the following changes: + +1. Import the extension with `import intel_extension_for_pytorch as ipex`. +2. Invoke the `optimize()` function to apply optimizations. +3. Convert the eager mode model to a graph mode model. + - For TorchScript, invoke `torch.jit.trace()` and `torch.jit.freeze()` + - For TorchDynamo, invoke `torch.compile(model, backend="ipex")`(*Beta feature*) + +**Important:** It is highly recommended to `import intel_extension_for_pytorch` right after `import torch`, prior to importing other packages. + +The example below demostrates how to use the Intel® Extension for PyTorch\* with TorchScript: + +```python +import torch +############## import ipex ############### +import intel_extension_for_pytorch as ipex +########################################## + +model = Model() +model.eval() +data = ... + +############## TorchScript ############### +model = ipex.optimize(model, dtype=torch.bfloat16) + +with torch.no_grad(), torch.cpu.amp.autocast(): + model = torch.jit.trace(model, data) + model = torch.jit.freeze(model) + model(data) +########################################## +``` + +The example below demostrates how to use the Intel® Extension for PyTorch\* with TorchDynamo: + +```python +import torch +############## import ipex ############### +import intel_extension_for_pytorch as ipex +########################################## + +model = Model() +model.eval() +data = ... + +############## TorchDynamo ############### +model = ipex.optimize(model, weights_prepack=False) + +model = torch.compile(model, backend="ipex") +with torch.no_grad(): + model(data) +########################################## +``` + +More examples, including training and usage of low precision data types are available in the [Examples](./examples.md) section. + +In [Cheat Sheet](./cheat_sheet.md), you can find more commands that can help you start using the Intel® Extension for PyTorch\*. + + +## LLM Quick Start + +`ipex.llm.optimize` is used for Large Language Models (LLM). + + +```python +import torch +#################### code changes #################### +import intel_extension_for_pytorch as ipex +###################################################### +import argparse +from transformers import ( + AutoConfig, + AutoModelForCausalLM, + AutoTokenizer, +) + +# args +parser = argparse.ArgumentParser("Generation script (fp32/bf16 path)", add_help=False) +parser.add_argument( + "--dtype", + type=str, + choices=["float32", "bfloat16"], + default="float32", + help="choose the weight dtype and whether to enable auto mixed precision or not", +) +parser.add_argument( + "--max-new-tokens", default=32, type=int, help="output max new tokens" +) +parser.add_argument( + "--prompt", default="What are we having for dinner?", type=str, help="input prompt" +) +parser.add_argument("--greedy", action="store_true") +parser.add_argument("--batch-size", default=1, type=int, help="batch size") +args = parser.parse_args() +print(args) + +# dtype +amp_enabled = True if args.dtype != "float32" else False +amp_dtype = getattr(torch, args.dtype) + +# load model +model_id = MODEL_ID +config = AutoConfig.from_pretrained( + model_id, torchscript=True, trust_remote_code=True +) +model = AutoModelForCausalLM.from_pretrained( + model_id, + torch_dtype=amp_dtype, + config=config, + low_cpu_mem_usage=True, + trust_remote_code=True, +) +tokenizer = AutoTokenizer.from_pretrained( + model_id, + trust_remote_code=True +) +model = model.eval() +model = model.to(memory_format=torch.channels_last) + +# Intel(R) Extension for PyTorch* +#################### code changes #################### # noqa F401 +model = ipex.llm.optimize( + model, + dtype=amp_dtype, + inplace=True, + deployment_mode=True, +) +###################################################### # noqa F401 + +# generate args +num_beams = 1 if args.greedy else 4 +generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=num_beams) + +# input prompt +prompt = args.prompt +input_size = tokenizer(prompt, return_tensors="pt").input_ids.size(dim=1) +print("---- Prompt size:", input_size) +prompt = [prompt] * args.batch_size + +# inference +with torch.inference_mode(), torch.cpu.amp.autocast(enabled=amp_enabled): + input_ids = tokenizer(prompt, return_tensors="pt").input_ids + gen_ids = model.generate( + input_ids, + max_new_tokens=args.max_new_tokens, + **generate_kwargs + ) + gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True) + input_tokens_lengths = [x.shape[0] for x in input_ids] + output_tokens_lengths = [x.shape[0] for x in gen_ids] + total_new_tokens = [ + o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths) + ] + print(gen_text, total_new_tokens, flush=True) +``` + +More LLM examples, including usage of low precision data types are available in the [LLM Examples](https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/llm) section. diff --git a/cpu/2.4.0+cpu/_sources/tutorials/installation.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/installation.md.txt new file mode 100644 index 000000000..707a091db --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/installation.md.txt @@ -0,0 +1,8 @@ +Installation +============ + +Select your preferences and follow the installation instructions provided on the [Installation page](../../../index.html#installation?platform=cpu&version=v2.4.0%2Bcpu). + +After successful installation, refer to the [Quick Start](getting_started.md) and [Examples](examples.md) sections to start using the extension in your code. + +**NOTE:** For detailed instructions on installing and setting up the environment for Large Language Models (LLM), as well as example scripts, refer to the [LLM best practices](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/llm). diff --git a/cpu/2.4.0+cpu/_sources/tutorials/introduction.rst.txt b/cpu/2.4.0+cpu/_sources/tutorials/introduction.rst.txt new file mode 100644 index 000000000..8037db666 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/introduction.rst.txt @@ -0,0 +1,25 @@ +Introduction +============ + +Intel® Extension for PyTorch* extends PyTorch* with the latest performance optimizations for Intel hardware. +Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs. + +.. note:: + + The package name used when you import Intel® Extension for PyTorch\* changed + from ``intel_pytorch_extension`` (for versions 1.2.0 through 1.9.0) to + ``intel_extension_for_pytorch`` (for versions 1.10.0 and later). Use the + correct package name depending on the version you are using. + +For the detailed list of supported features and usage instructions, refer to `Features `_. For overview of Large Language Models (LLM) optimizations and usage instructions, refer to +the `Large Language Models (LLM) `_ section. + +Get Started +----------- +- `Installation <../../../index.html#installation?platform=cpu&version=v2.4.0%2Bcpu>`_ +- `Quick Start `_ +- `Examples `_ + +API Documentation +----------------- +For detailed description of the Intel® Extension for PyTorch* APIs, refer to the `API Documentation `_ section. diff --git a/cpu/2.4.0+cpu/_sources/tutorials/known_issues.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/known_issues.md.txt new file mode 100644 index 000000000..0aff2be20 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/known_issues.md.txt @@ -0,0 +1,124 @@ +Troubleshooting +=============== + +## General Usage + +- **Problem**: Issues with the `+cpu` PyTorch package. + - **Cause**: Certain Python packages may have PyTorch as a hard dependency. If you installed the `+cpu` version of PyTorch, installation of these packages might replace the `+cpu` version with the default version released on Pypi.org. + - **Solution**: Reinstall the `+cpu` version back. +- **Problem**: The workload running with Intel® Extension for PyTorch\* occupies a remarkably large amount of memory. + - **Solution**: Try to reduce the occupied memory size by setting the `weights_prepack` parameter of the `ipex.optimize()` function to `False`. +- **Problem**: The `conv+bn` folding feature of the `ipex.optimize()` function does not work if inference is done with a custom function: + + ``` + import torch + import intel_pytorch_extension as ipex + + class Module(torch.nn.Module): + def __init__(self): + super(Module, self).__init__() + self.conv = torch.nn.Conv2d(1, 10, 5, 1) + self.bn = torch.nn.BatchNorm2d(10) + self.relu = torch.nn.ReLU() + + def forward(self, x): + x = self.conv(x) + x = self.bn(x) + x = self.relu(x) + return x + + def inference(self, x): + return self.forward(x) + + if __name__ == '__main__': + m = Module() + m.eval() + m = ipex.optimize(m, dtype=torch.float32, level="O0") + d = torch.rand(1, 1, 112, 112) + with torch.no_grad(): + m.inference(d) + ``` + + - **Cause**: PyTorch FX limitation. + - **Solution**: You can avoid this error by calling `m = ipex.optimize(m, level="O0")`, which doesn't apply ipex optimization, or disable `conv+bn` folding by calling `m = ipex.optimize(m, level="O1", conv_bn_folding=False)`. + +## Performance Regression + +- Some models may experience performance regression comparing to 2.0.x due to deprecation of the NNC feature in PyTorch\*. + +## TorchDynamo + +- **Problem**: A workload that uses `torch.compile()` fails to run or demonstrates poor performance. + - **Cause**: The support of `torch.compile()` with `ipex` as the backend is still an beta feature. Currently, the following HuggingFace models fail to run using `torch.compile()` with `ipex` backend due to memory issues: + - masked-language-modeling+xlm-roberta-base + - casual-language-modeling+gpt2 + - casual-language-modeling+xlm-roberta-base + - summarization+t5-base + - text-classification+allenai-longformer-base-409 + - **Solution**: Use the `torch.jit` APIs and graph optimization APIs of the Intel® Extension for PyTorch\*. + +## Dynamic Shape + +- **Problem**: When working with an NLP model inference with dynamic input data length using TorchScript (either `torch.jit.trace` or `torch.jit.script`), performance with Intel® Extension for PyTorch\* may be less than that without Intel® + Extension for PyTorch\*. + - **Solution**: Use the workaround below: + + - Python interface + ```python + torch._C._jit_set_texpr_fuser_enabled(False) + ``` + - C++ interface + ```c++ + #include + torch::jit::setTensorExprFuserEnabled(false); + ``` + +## INT8 + +- **Problem**: Limitations of dynamic shapes support of static quantization: + - When an input shape is provided in runtime for the first time, execution could take longer time to compile a new kernel for this shape. Specifically, the new kernel compilation time could be long for complicated kernels. + - Channels Last format won't take effect with dynamic input shapes for CNN models at this time. Optimizations are undergoing. +- **Problem**: `RuntimeError: Overflow when unpacking long` when a tensor's min max value exceeds int range while performing int8 calibration. + - **Solution**: Customize `QConfig` to use min-max calibration method. +- **Problem**: Models get large accuracy loss with the default quantization recipe. + - **Solution**: Try using the [the INT8 Recipe Tuning API](./features/int8_recipe_tuning_api.md) to tune a recipe with satisfied accuracy loss. +- **Problem**: Incorrect results with large tensors when calibrating with `quantize_per_tensor`, when benchmarking with 1 OpenMP\* thread (find more detailed info [here](https://github.com/pytorch/pytorch/issues/80501). + - **Solution**: Editing your code following the pseudocode below can workaround this issue, if you do need to explicitly set `OMP_NUM_THREAEDS=1` for benchmarking. However, there could be a performance regression if oneDNN graph compiler prototype feature is used. + + Workaround pseudocode: + ``` + # perform convert/trace/freeze with omp_num_threads > 1(N) + torch.set_num_threads(N) + prepared_model = prepare(model, input) + converted_model = convert(prepared_model) + traced_model = torch.jit.trace(converted_model, input) + freezed_model = torch.jit.freeze(traced_model) + # run freezed model to apply optimization pass + freezed_model(input) + + # benchmarking with omp_num_threads = 1 + torch.set_num_threads(1) + run_benchmark(freezed_model, input) + ``` +- For models with dynamic control flow, please try dynamic quantization. Users are likely to get performance gain for GEMM models. +- Support for `EmbeddingBag` with INT8 when bag size > 1 is work in progress. + +## BFloat16 + +- **Problem**: BF16 AMP(auto-mixed-precision) runs abnormally with the extension on the AVX2-only machine if the topology contains `Conv`, `Matmul`, `Linear`, and `BatchNormalization`. + - **Solution**: TBD + +- **Problem**: A PyTorch* model containing `torch.nn.TransformerEncoderLayer` component may encounter a RuntimeError in BF16 training or inference process if the model is optimized by `ipex.optimize()` with arguments set to default values. + - **Solution**: `TransformerEncoderLayer` optimized by `ipex.optimize()` with weight prepacking functionality enabled may encounter a weight dimension issue. The error can be avoided by disabling weight prepacking, `model = ipex.optimize(model, weights_prepack=False)`. + +## Runtime Extension + +The following limitations currently exist: + +- Runtime extension of `MultiStreamModule` does not support DLRM inference, since the input of DLRM (EmbeddingBag specifically) cannot be simply batch split. +- Runtime extension of `MultiStreamModule` has poor performance of RNNT Inference comparing with native throughput mode. Only part of the RNNT models (`joint_net` specifically) can be jit traced into graph. However, in one batch inference, `joint_net` is invoked multiple times. It increases the overhead of `MultiStreamModule` as input batch split, thread synchronization and output concat. + +## Result Correctness + +- **Problem**: Incorrect Conv and Linear result if the number of OMP threads is changed at runtime. + - **Cause**: The oneDNN memory layout depends on the number of OMP threads, which requires the caller to detect the changes for the # of OMP threads while this release has not implemented it yet. \ No newline at end of file diff --git a/cpu/2.4.0+cpu/_sources/tutorials/license.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/license.md.txt new file mode 100644 index 000000000..de2fc8838 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/license.md.txt @@ -0,0 +1,9 @@ +License +======= + +Intel® Extension for PyTorch\* is licensed under [Apache License Version 2.0](http://www.apache.org/licenses/LICENSE-2.0). This software includes components that have separate copyright notices and licensing terms. Your use of the source code for these components is subject to the terms and conditions of the following licenses. + +Apache License Version 2.0: + +[Intel® Extension for PyTorch\* LICENSE](https://github.com/intel/intel-extension-for-pytorch/blob/main/LICENSE.txt) + diff --git a/cpu/2.4.0+cpu/_sources/tutorials/llm.rst.txt b/cpu/2.4.0+cpu/_sources/tutorials/llm.rst.txt new file mode 100644 index 000000000..3c2878a72 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/llm.rst.txt @@ -0,0 +1,152 @@ +Large Language Models (LLM) Optimization Overview +================================================== + +In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Most of LLMs are GPT-like architectures that consist of multiple Decoder layers. +The MultiHeadAttention and FeedForward layer are two key components of every Decoder layer. The generation task is memory bound because iterative decode and kv_cache require special management to reduce memory overheads. Intel® Extension for PyTorch* provides a lot of specific optimizations for these LLMs. +On the operator level, the extension provides highly efficient GEMM kernel to speed up Linear layer and customized operators to reduce the memory footprint. To better trade-off the performance and accuracy, different low-precision solutions e.g., smoothQuant and weight-only-quantization are also enabled. Besides, tensor parallel can also adopt to get lower latency for LLMs. + +These LLM-specific optimizations can be automatically applied with a single frontend API function in Python interface, `ipex.llm.optimize()`. Check `llm.optimize <./llm/llm_optimize.md>`_ for more details. + +.. toctree:: + :hidden: + :maxdepth: 1 + + llm/llm_optimize + +`ipex.llm` Optimized Model List for Inference +------------------------------- + +Verified for single instance mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. raw:: html + :file: ../_static/htmls/tbl_single.html + +Verified for distributed inference mode via DeepSpeed +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. raw:: html + :file: ../_static/htmls/tbl_deepspeed.html + +*Note*: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16). We are working in progress to better support the models in the tables with various data types. In addition, more models will be optimized in the future. + +Please check `LLM best known practice `_ for instructions to install/setup environment and example scripts. + +Module Level Optimization API for customized LLM (Prototype) +------------------------------------------------------------ + +In the past year, LLM has been flourishing with many open-sourced models contributed to the community, while researchers are building their own LLMs from transformer blocks with variants in implementation details. To help LLM researchers and developers improve their productivity, Intel® Extension for PyTorch* provides module level optimizations for commonly used LLM modules and functionalities, which are operators or certain operator combinations in nature. + +Please check `LLM module level optimization practice `_ to better understand how to use `module level APIs `_ to optimize your LLM and achieve better performance. + +Demos +----- + +Intel® Extension for PyTorch* LLM optimizations can be integrated into a typical LLM Q&A web service. + +.. list-table:: + + * - .. image:: ../../images/llm/GenAI-bf16.gif + :width: 500 + :alt: UI with BF16 + + - .. image:: ../../images/llm/GenAI-int8.gif + :width: 500 + :alt: UI with INT8 + +Following figures show demos with Llama 2 model and GPT-J model with single inference and distributed inference with deepspeed with lower precision data types. + +.. list-table:: + + * - .. figure:: ../../images/llm/bf16_llama.gif + :width: 300 + :alt: Llama 2 with BF16 + + a + + - .. figure:: ../../images/llm/smoothquant_int8_llama.gif + :width: 300 + :alt: Llama 2 with INT8 Quantization with SmoothQuant + + b + + - .. figure:: ../../images/llm/woq_int8_llama.gif + :width: 300 + :alt: Weight Only Quantization with INT8 for Llama 2 + + c + + * - .. figure:: ../../images/llm/woq_int4_gptj.gif + :width: 300 + :alt: Weight Only Quantization with INT4 for GPT-J + + d + + - .. figure:: ../../images/llm/autotp_bf16_llama.gif + :width: 300 + :alt: Distributed Inference with DeepSpeed with BF16 on Llama 2 with AutoTP feature + + e + + - .. figure:: ../../images/llm/autotp_woq_int8_llama.gif + :width: 300 + :alt: Distributed Inference with DeepSpeed with Weight Only Quantization INT8 on Llama 2 with AutoTP feature + + f + +Figure Legends: + +a. Llama 2 model with BF16 +b. Llama 2 model with INT8 Quantization with SmoothQuant technique +c. Llama 2 model with INT8 Weight Only Quantization +d. GPT-J model with INT4 Weight Only Quantization +e. Llama 2 model Distributed Inference with DeepSpeed with AutoTP feature on BF16 +f. Llama 2 model Distributed Inference with DeepSpeed with AutoTP feature with Weight Only Quantization INT8 + +Optimization Methodologies +-------------------------- + +The section below provides a brief introduction to LLM optimization methodologies: + +Linear Operator Optimization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Linear operator is the most obvious hotspot in LLMs inference. Intel® Extension for PyTorch* provides dedicated optimization to speedup linear GEMM kernels, through oneDNN, customized linear kernels for weight only quantization, and some other specific tuning. All of them use specific block format to utilize hardware resources in a highly efficient way. + +Low Precision Data Types +~~~~~~~~~~~~~~~~~~~~~~~~ + +While Generative AI (GenAI) workloads and models are getting more and more popular, LLMs used in these workloads are getting more and more parameters. The increasing size of LLMs enhances workload accuracies; however, it also leads to significantly heavier computations and places higher requirements to the underlying hardware. Given that, quantization becomes a more important methodology for inference workloads. + +Quantization with shorter data types benefits from its nature to improve memory IO throughputs and amount of computations on CPU. Moreover, shorter data types make it possible to keep more data in CPU cache, thus reducing memory access occurrences. Comparing to cache access, memory access is much more time costing. Specifically from computation perspective, AVX-512 Vector Neural Network Instructions (VNNI) instruction set shipped with the 2nd Generation Intel® Xeon® Scalable Processors and newer, as well as Intel® Advanced Matrix Extensions (Intel® AMX) instruction set shipped with the 4th Generation Intel® Xeon® Scalable Processors, provide instruction level accelerations to INT8 computations. + +Except for the mixed-precision and INT8 native quantization solution, e.g., post-training static quantization and dynamic quantization in Pytorch, `SmoothQuant `_ and weight only quantization (both INT8 weight and INT4 weight are supported) are also enabled in Intel® Extension for PyTorch* to get beeter accuracy and performance compared with native solution. + +Intel® Extension for PyTorch* speeds up INT8 computations by leveraging oneDNN and oneDNN graph as the backend. Intel® Extension for PyTorch* static quantization provides a default recipe to automatically decide which operators to quantize. Its backend oneDNN graph brings matrix-multiplication-based fusions for common seen operator patterns and other common fusions like quantization + data type casting. These fusions help achieve best computation cache locality and efficiency, and thus reduce INT8 quantization overhead significantly. + +Intel® Extension for PyTorch* also delivers INT4 optimizations via 4-bit weight-only quantization (WOQ). As the name indicates, WOQ quantizes only weights to 4-bit integers to further improve the computation efficiency via saved memory bandwidth utilization. This technique reduces text generation latency especially from the second token. AMX INT8 instructions and fusions are also applied for these performant computations. + +Indirect Access KV Cache +~~~~~~~~~~~~~~~~~~~~~~~~ + +kv_cache is used to reduce computation for decoder layer but it also brings memory overheads. For example, when we use beam search, the kv_cache should be reordered according to latest beam idx and the current key/value should also be concat with kv_cache in the attention layer to get entire context to do scale dot product. When the sequence is very long, memory overheads caused by the reorder_cache and concat will be performance bottleneck. Indirect Access KV_cache (IAKV) is provided to reduce these overheads. Firstly, IAKV pre-allocates buffers (key and value use different buffer) to store all key/value hidden states and beam index information, the data format is shown in the following left figure (beam_width=4 in this case) and token state of key (value) in every timestamp will be store in this pre-allocated buffer. Secondly, we can use beam index history which is shown in the following right figure to decide which beam should be used by a timestamp and this information will generate a offset to access the kv_cache buffer which means that the reorder_cache and concat overheads will be eliminated by this way. + + +.. image:: ../../images/llm/llm_iakv_1.png + :width: 400 + :alt: The key/value cache data format + + +.. image:: ../../images/llm/llm_iakv_2.png + :width: 400 + :alt: The beam idx trace for every step + +Graph Optimization +~~~~~~~~~~~~~~~~~~ + +Operators fusion is generally used to enable sub-graph fusion to reduce the memory footprint. Except for linear post ops fusion, e.g, linear + activation function, a lot of customized operators are also provided in Intel® Extension for PyTorch* for further performance improvement. For example, Rotary Position Embedding (ROPE) and Root Mean Square Layer Normalization (RMSNorm). + +Distributed Inference +~~~~~~~~~~~~~~~~~~~~~ + +All above optimizations already help you to get very good performance with single instance. To further reduce the inference latency and improve throughput, tensor parallel is also enabled in our solution. You can firstly use DeepSpeed to auto shard the model and then apply above optimizations with the frontend API function provided by Intel® Extension for PyTorch. diff --git a/cpu/2.4.0+cpu/_sources/tutorials/llm/llm_optimize.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/llm/llm_optimize.md.txt new file mode 100644 index 000000000..96e6c4bd8 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/llm/llm_optimize.md.txt @@ -0,0 +1,136 @@ +LLM Optimizations Frontend API +====================================== + +The new API function, `ipex.llm.optimize`, is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). +It provides optimizations for both model-wise and content-generation-wise. +You just need to invoke the `ipex.llm.optimize` function instead of the `ipex.optimize` function to apply all optimizations transparently. + +This API currently supports for inference workloads of certain models. +API documentation is available at [API Docs page](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/api_doc.html#ipex.llm.optimize), +and supported model list can be found at [this page](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/llm.html#ipexllm-optimized-model-list-for-inference). + +For LLM fine-tuning, please check the [LLM fine-tuning tutorial](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/llm/fine-tuning). + +API documentation is available at [API Docs page](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/api_doc.html#ipex.llm.optimize). + +## Pseudocode of Common Usage Scenarios + +The following sections show pseudocode snippets to invoke Intel® Extension for PyTorch\* APIs to work with LLM models. +Complete examples can be found at [the Example directory](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/llm/inference). + +### FP32/BF16 + +``` python +import torch +import intel_extension_for_pytorch as ipex +import transformers + +model= transformers.AutoModelForCausalLM(model_name_or_path).eval() + +dtype = torch.float # or torch.bfloat16 +model = ipex.llm.optimize(model, dtype=dtype) + +# inference with model.generate() +... +``` + +### SmoothQuant + +Supports INT8. + +``` python +import torch +#################### code changes #################### # noqa F401 +import intel_extension_for_pytorch as ipex +from intel_extension_for_pytorch.quantization import prepare +###################################################### # noqa F401 +import transformers +# load model +model = transformers.AutoModelForCausalLM.from_pretrained(...).eval() +#################### code changes #################### # noqa F401 +qconfig = ipex.quantization.get_smooth_quant_qconfig_mapping() +# stage 1: calibration +# prepare your calibration dataset samples +calib_dataset = DataLoader(your_calibration_dataset) +example_inputs = ... # get one sample input from calib_samples +calibration_model = ipex.llm.optimize( + model.eval(), + quantization_config=qconfig, +) +prepared_model = prepare( + calibration_model.eval(), qconfig, example_inputs=example_inputs +) +with torch.no_grad(): + for calib_samples in enumerate(calib_dataset): + prepared_model(calib_samples) +prepared_model.save_qconf_summary(qconf_summary=qconfig_summary_file_path) + +# stage 2: quantization +model = ipex.llm.optimize( + model.eval(), + quantization_config=qconfig, + qconfig_summary_file=qconfig_summary_file_path, +) +###################################################### # noqa F401 + +# generation inference loop +with torch.inference_mode(): + model.generate({your generate parameters}) +``` + +### Weight Only Quantization (WOQ) + +Supports INT8 and INT4. + +``` python +import torch +import intel_extension_for_pytorch as ipex +import transformers + +model= transformers.AutoModelForCausalLM(model_name_or_path).eval() + +qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping( + weight_dtype=ipex.quantization.WoqWeightDtype.INT8, # or INT4/NF4 + lowp_mode=ipex.quantization.WoqLowpMode.NONE, # or FP16, BF16, INT8 +) + +checkpoint = None # optionally load int4 or int8 checkpoint +model = ipex.llm.optimize(model, quantization_config=qconfig, low_precision_checkpoint=checkpoint) + +# inference with model.generate() +... +``` + +### Distributed Inference with DeepSpeed + +Distributed inference can be performed with `DeepSpeed`. Based on original Intel® Extension for PyTorch\* scripts, the following code changes are required. + +Check [LLM distributed inference examples](https://github.com/intel/intel-extension-for-pytorch/tree/v2.4.0%2Bcpu/examples/cpu/llm/inference/distributed) for complete codes. + +``` python +import torch +import intel_extension_for_pytorch as ipex +import deepspeed +import transformers + +dtype = torch.float # or torch.bfloat16 +deepspeed.init_distributed(deepspeed.accelerator.get_accelerator().communication_backend_name()) + +world_size = ... # get int from env var "WORLD_SIZE" or "PMI_SIZE" +with deepspeed.OnDevice(dtype=dtype, device="meta"): + model= transformers.AutoModelForCausalLM(model_name_or_path).eval() +model = deepspeed.init_inference( + model, + mp_size=world_size, + base_dir=repo_root, + dtype=dtype, + checkpoint=checkpoints_json, + **kwargs, +) +model = model.module + +model = ipex.llm.optimize(model, dtype=dtype) + +# inference +... +``` diff --git a/cpu/2.4.0+cpu/_sources/tutorials/performance.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/performance.md.txt new file mode 100644 index 000000000..b3754111b --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/performance.md.txt @@ -0,0 +1,661 @@ +Performance +=========== + +## Overview + +This page shows performance boost with Intel® Extension for PyTorch\* on several popular topologies. + +## Performance Data for Intel® AI Data Center Products + +Find the latest performance data for 4th gen Intel® Xeon® Scalable processors and 3rd gen Intel® Xeon® processors, including detailed hardware and software configurations, at [Intel® Developer Zone article](https://www.intel.com/content/www/us/en/developer/topic-technology/artificial-intelligence/performance.html). + +## LLM Performance + +We benchmarked LLaMA2 7B, 13B, GPT-J 6B with test input token length set to 256 and 1024 respectively. The tests were carried out on AWS M7i and M6i instances. CPUs of M6i instances are 3rd Gen Intel® Xeon® Processors which do not have AMX instructions for BF16 computing acceleration, so we take FP32 precision for benchmarking instead of BF16 on M6i instances. + +![LLaMA2 7B Results](../../images/performance/m7i_m6i_comp_llama7b.png) + +![LLaMA2 13B Results](../../images/performance/m7i_m6i_comp_llama13b.png) + +![GPT-J 6B Results](../../images/performance/m7i_m6i_comp_gptj6b.png) + +The LLM inference performances on M7i and M6i instances are compared based on the above results. M7i, with the 4th Gen Xeon® processors, has a remarkable performance advantage over M6i with the 3rd Gen Xeon® processors. + +M7i performance boost ratio over M6i for non-quantized (BF16 or FP32) models: + +| | Speedup | Throughput | +|:----------:|:-------:|:----------:| +| LLaMA2 7B | 2.47x | 2.62x | +| LLaMA2 13B | 2.57x | 2.62x | +| GPT-J 6B | 2.58x | 2.85x | + +M7i performance boost ratio over M6i for INT8 quantized models: + +| | Speedup | Throughput | +|:----------:|:-------:|:----------:| +| LLaMA2 7B | 1.27x | 1.38x | +| LLaMA2 13B | 1.27x | 1.27x | +| GPT-J 6B | 1.29x | 1.36x | + +We can also conclude that **with a larger batch size the capacity of the model service can be improved at the cost of longer response latency for the individual sessions**. The following table exhibits that for INT8 quantized LLaMA2-7b model on M7i instances, input batch_size=8 would increase the total throughput by 6.47x compared with batch_size=1, whereas P90 token latency gets 1.26x longer. + +| Batch size | Decoder latency | Total tokens per sec | +|:----------:|:---------------:|:--------------------:| +| 1 | 39 | 26.32 | +| 8 | 49 | 170.21 | +| | | | +|***Ratio*** | 1.26x | 6.47x | + +*Note:* Measured by Intel on 17th Aug 2023; M7i.16xLarge, M6i.16xLarge instances in US-west-2. OS-Ubuntu 22.04-lts, kernel 6.20.0-1009-aws, SW: PyTorch* 2.1 and Intel® Extension for PyTorch* 2.1/llm_feature_branch. + +## INT8 with v1.11 + +### Performance Numbers + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HardwareWorkload1PrecisionThroughput Inference2Realtime Inference3Model TypeDatasetInput Data ShapeTunable Parameters
Batch SizeBoost RatioBatch SizeBoost Ratio
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHzResNet50INT8801.83x11.44xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
SSD-ResNet34INT8802.16x11.83xComputer VisionCOCOInput shape
[3, 1200, 1200]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ResNext 32x16dINT8801.81x11.21xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
VGG-11INT8801.75x11.19xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ShuffleNetv2_x1.0INT8802.07x11.47xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
BERT-LargeINT8802.78x12.04xNLPSquadmax_seq_len=384
Task: Question Answering
Jemalloc;
Intel(R) OpenMP;
inference scripts
Bert-BaseINT8802.05x11.96xNLPMRPCmax_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts
DistilBERT-BaseINT8802.12x11.57xNLPSquadmax_seq_len=384
Task: Question Answering
Jemalloc;
Intel(R) OpenMP;
inference scripts
+ +
+1. Model Zoo for Intel® Architecture +
+2. Throughput inference runs with single instance per socket. +
+3. Realtime inference runs with multiple instances, 4 cores per instance. +
+ +*Note:* Performance numbers with stock PyTorch are measured with its most performant configuration. + +*Note:* Environment variable *DNNL_PRIMITIVE_CACHE_CAPACITY* is set to *1024*. + +### Accuracy + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
WorkloadMetricFP32INT8INT8/FP32
BERT-base_text_classificationf10.810.8199.79%
BERT-Largef193.1693.0299.85%
Distilbert-basef186.8486.1399.19%
ResNet50Top176.1575.9899.78%
ResNext 32x16dTop184.1784.0599.86%
SSD-ResNet34mAP0.2000.19999.48%
VGG11Top169.0467.9698.44%
Shufflenetv2_x1.0Top169.3667.9297.93%1
+ +
+1. ShuffleNet INT8 accuracy is expected to improve w/o performance trade-off via histogram calibration algorithm. +
+ +### Configuration + +#### Software Version + +| Software | Version | +| :-: | :-: | +| PyTorch | [v1.11.0](https://pytorch.org/get-started/locally/) | +| Intel® Extension for PyTorch\* | [v1.11.0](https://github.com/intel/intel-extension-for-pytorch/releases) | + +#### Hardware Configuration + +| | 3rd Generation Intel® Xeon® Scalable Processors | +| :-: | :-: | +| CPU | Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz | +| Number of nodes | 1 | +| Number of sockets | 2 | +| Cores/Socket | 40 | +| Threads/Core | 2 | +| uCode | 0xd0002a0 | +| Hyper-Threading | ON | +| TurboBoost | ON | +| BIOS version | 04.12.02 | +| Number of DDR Memory slots | 16 | +| Capacity of DDR memory per slot | 16GB | +| DDR frequency | 3200 | +| Total Memory/Node (DDR+DCPMM) | 256GB | +| Host OS | CentOS Linux release 8.4.2105 | +| Host Kernel | 4.18.0-305.10.2.el8\_4.x86\_64 | +| Docker OS | Ubuntu 18.04.5 LTS | +| [Spectre-Meltdown Mitigation](https://github.com/speed47/spectre-meltdown-checker) | Mitigated | + +## FP32 with v1.11.200 on an AWS EC2 C6i.2xlarge instance + +### Performance Numbers + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HardwareWorkload1PrecisionThroughput Inference2Real-time Inference3Model TypeDatasetInput Data ShapeTunable Parameters
Batch SizeBoost RatioBatch SizeBoost Ratio
AWS EC2 C6i.2xlargeResNet50Float32641.24x11.31xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ResNext 32x16dFloat32641.07x11.05xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
VGG-11Float32641.15x11.21xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ShuffleNetv2_x1.0Float32641.12x11.30xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
MobileNet v2Float32641.08x11.12xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
BERT-LargeFloat32641.05x11.03xNLPSquadmax_seq_len=384
Task: Question Answering
Default memory allocator;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
Bert-BaseFloat32641.08x11.09xNLPMRPCmax_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 128
+ +
+1. Model Zoo for Intel® Architecture +
+2. Throughput inference runs with single instance per socket. +
+3. Realtime inference runs with multiple instances, 4 cores per instance. +
+ +*Note:* Performance numbers with stock PyTorch are measured with its most performant configuration. + +*Note:* Environment variable *DNNL_PRIMITIVE_CACHE_CAPACITY* is set to *1024*. + +### Configuration + +#### Software Version + +| Software | Version | +| :-: | :-: | +| PyTorch | [v1.11.0](https://pytorch.org/get-started/locally/) | +| Intel® Extension for PyTorch\* | [v1.11.200](https://github.com/intel/intel-extension-for-pytorch/releases) | + +## FP32 and BFloat16 with v1.10 + +### Performance Numbers + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HardwareWorkload1PrecisionThroughput Inference2Real-time Inference3Model TypeDatasetInput Data ShapeTunable Parameters
Batch SizeBoost RatioBatch SizeBoost Ratio
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHzResNet50Float32801.39x11.35xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
SSD-ResNet34Float321601.55x11.06xComputer VisionCOCOInput shape
[3, 1200, 1200]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ResNext 32x16dFloat32801.08x11.08xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
Faster R-CNN ResNet50 FPNFloat32801.71x11.07xComputer VisionCOCOInput shape
[3, 1200, 1200]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
VGG-11Float321601.20x11.13xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ShuffleNetv2_x1.0Float321601.32x11.20xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
MobileNet v2Float321601.48x11.12xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
DLRMFloat32801.11x1-RecommendationTerabyte-Default memory allocator;
Intel(R) OpenMP;
inference scripts
BERT-LargeFloat32801.14x11.02xNLPSquadmax_seq_len=384
Task: Question Answering
Default memory allocator;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
Bert-BaseFloat321601.10x11.33xNLPMRPCmax_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 128
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHzBERT-LargeBFloat16561.67x11.45xNLPSquadmax_seq_len=384
Task: Question Answering
Jemalloc;
Intel(R) OpenMP;
inference scripts
Bert-BaseBFloat161121.77x11.18xNLPMRPCmax_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts
+ +
+1. Model Zoo for Intel® Architecture +
+2. Throughput inference runs with single instance per socket. +
+3. Realtime inference runs with multiple instances, 4 cores per instance. +
+ +*Note:* Performance numbers with stock PyTorch are measured with its most performant configuration. + +*Note:* Environment variable *DNNL_PRIMITIVE_CACHE_CAPACITY* is set to *1024*. + +### Configuration + +#### Software Version + +| Software | Version | +| :-: | :-: | +| PyTorch | [v1.10.1](https://pytorch.org/get-started/locally/) | +| Intel® Extension for PyTorch\* | [v1.10.100](https://github.com/intel/intel-extension-for-pytorch/releases) | + +#### Hardware Configuration + +| | 3rd Generation Intel® Xeon® Scalable Processors | Products formerly Cooper Lake | +| :-: | :-: | :-: | +| CPU | Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz | Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz +| +| Number of nodes | 1 | 1 | +| Number of sockets | 2 | 2 | +| Cores/Socket | 40 | 28 | +| Threads/Core | 2 | 2 | +| uCode | 0xd0002a0 | 0x700001c | +| Hyper-Threading | ON | ON | +| TurboBoost | ON | ON | +| BIOS version | 04.12.02 | WLYDCRB1.SYS.0016.P29.2006080250 | +| Number of DDR Memory slots | 16 | 12 | +| Capacity of DDR memory per slot | 16GB | 64GB | +| DDR frequency | 3200 | 3200 | +| Total Memory/Node (DDR+DCPMM) | 256GB | 768GB | +| Host OS | CentOS Linux release 8.4.2105 | Ubuntu 18.04.4 LTS | +| Host Kernel | 4.18.0-305.10.2.el8\_4.x86\_64 | 4.15.0-76-generic | +| Docker OS | Ubuntu 18.04.5 LTS | Ubuntu 18.04.5 LTS | +| [Spectre-Meltdown Mitigation](https://github.com/speed47/spectre-meltdown-checker) | Mitigated | Mitigated | diff --git a/cpu/2.4.0+cpu/_sources/tutorials/performance_tuning/launch_script.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/performance_tuning/launch_script.md.txt new file mode 100644 index 000000000..61c5826f3 --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/performance_tuning/launch_script.md.txt @@ -0,0 +1,535 @@ +Launch Script Usage Guide +========================= + +## Overview + +As introduced in the [Performance Tuning Guide](tuning_guide.md), there are several factors that influence performance. Setting configuration options properly contributes to a performance boost. However, there is no unified configuration that is optimal to all topologies. Users need to try different combinations by themselves. A *launch* script is provided to automate these configuration settings to free users from this complicated work. This guide helps you to learn some common usage examples that cover many optimized configuration cases. + +The configurations are mainly around the following perspectives. +1. OpenMP library: [**Intel OpenMP library** (default) | GNU OpenMP library] +2. Memory allocator: [PyTorch default memory allocator | Jemalloc | **TCMalloc** (default)] +3. Number of instances: [**Single instance** (default) | Multiple instances] + +## Usage of launch script + +The *launch* script is provided as a module of *intel_extension_for_pytorch*. You can take advantage of it with the following command: + +``` +ipexrun [knobs] [args] +``` + +Available option settings (knobs) are listed below: + +| knob | type | default value | help | +| :-- | :--: | :--: | :-- | +| `-h`, `--help` | - | - | show this help message and exit | +| `-m`, `--module` | - | False | Changes each process to interpret the launch script as a python module, executing with the same behavior as 'python -m'. | +| `--no-python` | - | False | Avoid applying `python` to execute `program`. | +| `--log-dir` | str | '' | The log file directory. Setting it to empty ('') disables logging to files. | +| `--log-file-prefix` | str | 'run' | log file name prefix | + +Launcher Common Arguments: + +| knob | type | default value | help | +| :-- | :--: | :--: | :-- | +| `--ncores-per-instance` | int | 0 | Number of cores per instance. It has to be an integer larger than or equal to `-1`. When set to `0`, cores are evenly assigned to each instance. If number of cores cannot be divided by number of instances, residual cores are unused. When set to `-1`, cores are evenly assigned to each instance as much as possible to fully utilize all cores. When set to a number larger than `0`, designated number of cores are assigned to each instance. | +| `--nodes-list` | str | '' | Specify nodes list for multiple instances to run on, in format of list of single node ids "node_id,node_id,..." or list of node ranges "node_id-node_id,...". By default all nodes will be used. | +| `--use-e-cores` | - | False | Use Efficient-Cores on the workloads or not. By default, only Performance-Cores are used. | +| `--memory-allocator` | str | 'auto' | Choose which memory allocator to run the workloads with. Supported choices are ['auto', 'default', 'tcmalloc', 'jemalloc']. | +| `--omp-runtime` | str | 'auto' | Choose which OpenMP runtime to run the workloads with. Supported choices are ['auto', 'default', 'intel']. | +| `--strategy` | str | 'scatter' | Tell how cores are distributed over instances when only part of all cores are needed on a machine with multiple NUMA nodes. Supported choices are ['scatter', 'close']. With 'scatter', instances are distributed evenly as much as possible over all available NUMA nodes. While with 'close', instances are assigned to cores in order continuously. | + +Multi-instance Arguments: + +| knob | type | default value | help | +| :-- | :--: | :--: | :-- | +| `--ninstances` | int | 0 | Number of instances | +| `--instance-idx` | int | -1 | Inside the multi instance list, execute a specific instance at index. If it is set to -1, run all of them. | +| `--use-logical-cores` | - | False | Use logical cores on the workloads or not. By default, only physical cores are used. | +| `--bind-numa-node` | - | False | Bind instances to be executed on cores on a single NUMA node. | +| `--multi-task-manager` | str | 'auto' | Choose which multi task manager to run the workloads with. Supported choices are ['auto', 'none', 'numactl', 'taskset']. | +| `--latency-mode` | - | False | Use 4 cores per instance over all physical cores. | +| `--throughput-mode` | - | False | Run one instance per node with all physical cores. | +| `--cores-list` | str | '' | Specify cores list for multiple instances to run on, in format of list of single core ids "core_id,core_id,..." or list of core ranges "core_id-core_id,...". By default all cores will be used. | +| `--benchmark` | - | False | Enable benchmark config. JeMalloc's MALLOC_CONF has been tuned for low latency. Recommend to use this for benchmarking purpose; for other use cases, this MALLOC_CONF may cause Out-of-Memory crash. | + +Distributed Training Arguments With oneCCL backend: + +| knob | type | default value | help | +| :-- | :--: | :--: | :-- | +| `--nnodes` | int | 0 | Number of machines/devices to use for distributed training | +| `--nprocs-per-node` | int | 0 | Number of processes run on each machine/device. It is by default the number of available nodes when set to `0`. Argument `--nodes-list` affects this default value. | +| `--ccl-worker-count` | int | 4 | Number of cores per rank for ccl communication | +| `--logical-cores-for-ccl` | - | False | Use logical cores for the ccl worker. | +| `--master-addr` | str | 127.0.0.1 | Address of master node (rank 0), should be either IP address or hostname of node 0. For single node multi-proc training, the --master-addr can simply be 127.0.0.1. | +| `--master-port` | int | 29500 | Port on master node (rank 0) for communication during distributed training. | +| `--hostfile` | str | 'hostfile' | Set the hostfile for multi-node multi-proc training. The hostfile includes a node address list containing either IP addresses or hostnames of computation nodes. | +| `--extra-mpi-params` | str | '' | Extra parameters for mpiexec.hydra except for -np -ppn -hostfile and -genv I_MPI_PIN_DOMAIN | + +[Codeless Optimization feature](../features/codeless_optimization.md) related option settings (knobs) are listed below: + +| knob | type | default value | help | +| :-- | :--: | :--: | :-- | +| `--auto-ipex` | - | False | Auto enabled the ipex optimization feature | +| `--dtype` | string | False | data type, can choose from ['float32', 'bfloat16'] | +| `--auto-ipex-verbose` | - | False | This flag is only used for debug and UT of auto ipex. | +| `--disable-ipex-graph-mode` | - | False | Enable the Graph Mode for `ipex.optimize()` function | + +**Note:** `--latency-mode` and `--throughput-mode` are exclusive knobs to `--ninstances`, `--ncores-per-instance` and `--use-logical-cores`. I.e., setting either of `--latency-mode` or `--throughput-mode` overwrites settings of `--ninstances`, `--ncores-per-instance` and `--use-logical-cores` if they are explicitly set in command line. `--latency-mode` and `--throughput-mode` are mutually exclusive. + +The *launch* script respects existing environment variables when it get launched, except for *LD_PRELOAD*. If you have your favorite values for certain environment variables, you can set them before running the *launch* script. Intel OpenMP library uses an environment variable *KMP_AFFINITY* to control its behavior. Different settings result in different performance numbers. By default, if you enable Intel OpenMP library, the *launch* script will set *KMP_AFFINITY* to `granularity=fine,compact,1,0`. If you want to try with other values, you can use `export` command on Linux to set *KMP_AFFINITY* before you run the *launch* script. In this case, the script will not set the default value but take the existing value of *KMP_AFFINITY*, and print a message to stdout. + +Execution via the *launch* script can dump logs into files under a designated log directory so you can do some investigations afterward. By default, it is disabled to avoid undesired log files. You can enable logging by setting knob `--log-dir` to be: + +- directory to store log files. It can be an absolute path or relative path. +- types of log files to generate. One file (`_timestamp_instances.log`) contains command and information when the script was launched. Another type of file (`_timestamp_instance_#_core#-core#....log`) contain stdout print of each instance. + +For example: + +``` +run_20210712212258_instances.log +run_20210712212258_instance_0_cores_0-43.log +``` + +## Usage Examples + +Example script [resnet50.py](https://github.com/intel/intel-extension-for-pytorch/blob/v2.0.100%2Bcpu/examples/cpu/inference/python/resnet50_general_inference_script.py) will be used in this guide. + +- Single instance for inference + - [I. Use all physical cores](#i-use-all-physical-cores) + - [II. Use all cores including logical cores](#ii-use-all-cores-including-logical-cores) + - [III. Use physical cores on 1 node](#iii-use-physical-cores-on-1-node) + - [IV. Use your designated number of cores](#iv-use-your-designated-number-of-cores) +- Multiple instances for inference + - [V. Throughput mode (i.e. number of numa node instances, each instance runs on 1 numa node)](#v-throughput-mode) + - [VI. Latency mode (Use 4 cores for each instance)](#vi-latency-mode) + - [VII. Your designated number of instances](#vii-your-designated-number-of-instances) + - [VIII. Your designated number of instances and instance index](#viii-your-designated-number-of-instances-and-instance-index) +- Usage of Jemalloc/TCMalloc/Default memory allocator + - [Jemalloc](#jemalloc) + - [TCMalloc](#tcmalloc) + - [Default memory allocator](#default-memory-allocator) +- Usage of GNU OpenMP library + - [Intel OpenMP library](#intel-openmp-library) + - [GNU OpenMP library](#gnu-openmp-library) + +__Note:__ GIF files below illustrate CPU usage ONLY. Do NOT infer performance numbers. + +### Single instance for inference + +#### I. Use all physical cores + +``` +ipexrun --log-dir ./logs resnet50.py +``` + +CPU usage is shown as below. 1 main worker thread was launched, then it launched physical core number of threads on all physical cores. + +![Single instance all physical cores](../../../images/launch_script/1ins_phy.gif) + +If you check your log directory, you will find directory structure as below. + +``` +. +├── resnet50.py +└── logs + ├── run_20210712212258_instance_0_cores_0-43.log + └── run_20210712212258_instances.log +``` + +The `run_20210712212258_instances.log` contains information and command that were used for this execution launch. + +``` +$ cat logs/run_20210712212258_instances.log +2021-07-12 21:22:58,764 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home//.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance +2021-07-12 21:22:58,764 - __main__ - INFO - OMP_NUM_THREADS=44 +2021-07-12 21:22:58,764 - __main__ - INFO - Using Intel OpenMP +2021-07-12 21:22:58,764 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-07-12 21:22:58,764 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-07-12 21:22:58,764 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so +2021-07-12 21:22:58,764 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] on different NUMA nodes +2021-07-12 21:22:58,764 - __main__ - INFO - numactl -C 0-43 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712212258_instance_0_cores_0-43.log +``` + +#### II. Use all cores including logical cores + +``` +ipexrun --use-logical-core --log-dir ./logs resnet50.py +``` + +CPU usage is shown as below. 1 main worker thread was launched, then it launched threads on all cores, including logical cores. + +![Single instance logical cores](../../../images/launch_script/1ins_log.gif) + +If you check your log directory, you will find directory structure as below. + +``` +. +├── resnet50.py +└── logs + ├── run_20210712223308_instances.log + └── run_20210712223308_instance_0_cores_0-87.log +``` + +The `run_20210712223308_instances.log` contains information and command that were used for this execution launch. + +``` +$ cat logs/run_20210712223308_instances.log +2021-07-12 22:33:08,117 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home//.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance +2021-07-12 22:33:08,117 - __main__ - INFO - OMP_NUM_THREADS=88 +2021-07-12 22:33:08,117 - __main__ - INFO - Using Intel OpenMP +2021-07-12 22:33:08,118 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-07-12 22:33:08,118 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-07-12 22:33:08,118 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so +2021-07-12 22:33:08,118 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87'] on different NUMA nodes +2021-07-12 22:33:08,118 - __main__ - INFO - numactl -C 0-87 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712223308_instance_0_cores_0-87.log +``` + +#### III. Use physical cores on designated nodes + +``` +ipexrun --nodes-list 1 --log-dir ./logs resnet50.py +``` + +CPU usage is shown as below. 1 main worker thread was launched, then it launched threads on all other cores on the same numa node. + +![Single instance all physical cores](../../../images/launch_script/1ins_soc.gif) + +If you check your log directory, you will find directory structure as below. + +``` +. +├── resnet50.py +└── logs + ├── run_20210712214504_instances.log + └── run_20210712214504_instance_0_cores_22-43.log + +``` + +The `run_20210712214504_instances.log` contains information and command that were used for this execution launch. + +``` +$ cat logs/run_20210712214504_instances.log +2021-07-12 21:45:04,512 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home//.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance +2021-07-12 21:45:04,513 - __main__ - INFO - OMP_NUM_THREADS=22 +2021-07-12 21:45:04,513 - __main__ - INFO - Using Intel OpenMP +2021-07-12 21:45:04,513 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-07-12 21:45:04,513 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-07-12 21:45:04,513 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so +2021-07-12 21:45:04,513 - __main__ - INFO - numactl -C 22-43 -m 1 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712214504_instance_0_cores_22-43.log +``` + +#### IV. Use your designated number of cores + +``` +ipexrun --ninstances 1 --ncores-per-instance 10 --log-dir ./logs resnet50.py +``` + +CPU usage is shown as below. 1 main worker thread was launched, then it launched threads on other 9 physical cores. + +![Single instance designated number of cores](../../../images/launch_script/1ins_cus.gif) + +If you check your log directory, you will find directory structure as below. + +``` +. +├── resnet50.py +└── logs + ├── run_20210712220928_instances.log + └── run_20210712220928_instance_0_cores_0-9.log +``` + +The `run_20210712220928_instances.log` contains information and command that were used for this execution launch. + +``` +$ cat logs/run_20210712220928_instances.log +2021-07-12 22:09:28,355 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home//.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance +2021-07-12 22:09:28,355 - __main__ - INFO - OMP_NUM_THREADS=10 +2021-07-12 22:09:28,355 - __main__ - INFO - Using Intel OpenMP +2021-07-12 22:09:28,355 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-07-12 22:09:28,356 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-07-12 22:09:28,356 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so +2021-07-12 22:09:28,356 - __main__ - INFO - numactl -C 0-9 -m 0 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712220928_instance_0_cores_0-9.log +``` + +You can also specify the cores to be utilized using `--cores-list` argument. For example, if core id 11-20 are desired instead of the first 10 cores, the launch command would be as below. + +``` +ipexrun --ncores-per-instance 10 --cores-list "11-20" --log-dir ./logs resnet50.py +``` + +Please notice that when specifying `--cores-list`, a correspondant `--ncores-per-instance` argument is required for instance number deduction. + +In this case the log directory should be like +``` +. +├── resnet50.py +└── logs + ├── run_20210712221615_instances.log + └── run_20210712221615_instance_0_cores_11-20.log +``` + +The `run_20210712221615_instances.log` contains information and command that were used for this execution launch. + +``` +$ cat logs/run_20210712221615_instances.log +2021-07-12 22:16:15,591 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home//.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance +2021-07-12 22:16:15,591 - __main__ - INFO - OMP_NUM_THREADS=10 +2021-07-12 22:16:15,591 - __main__ - INFO - Using Intel OpenMP +2021-07-12 22:16:15,591 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-07-12 22:16:15,591 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-07-12 22:16:15,591 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so +2021-07-12 22:16:15,591 - __main__ - INFO - numactl -C 11-20 -m 0 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221615_instance_0_cores_11-20.log +``` + +### Multiple instances for inference + +#### V. Throughput mode + +``` +ipexrun --throughput-mode --log-dir ./logs resnet50.py +``` + +CPU usage is shown as below. 2 main worker threads were launched on 2 numa nodes respectively, then they launched threads on other physical cores. + +![Multiple instance throughput mode](../../../images/launch_script/nins_thr.gif) + +If you check your log directory, you will find directory structure as below. + +``` +. +├── resnet50.py +└── logs + ├── run_20210712221150_instances.log + ├── run_20210712221150_instance_0_cores_0-21.log + └── run_20210712221150_instance_1_cores_22-43.log +``` + +The `run_20210712221150_instances.log` contains information and command that were used for this execution launch. + +``` +$ cat logs/run_20210712221150_instances.log +2021-07-12 22:11:50,233 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home//.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance +2021-07-12 22:11:50,233 - __main__ - INFO - OMP_NUM_THREADS=22 +2021-07-12 22:11:50,233 - __main__ - INFO - Using Intel OpenMP +2021-07-12 22:11:50,233 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-07-12 22:11:50,233 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-07-12 22:11:50,233 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so +2021-07-12 22:11:50,233 - __main__ - INFO - numactl -C 0-21 -m 0 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221150_instance_0_cores_0-21.log +2021-07-12 22:11:50,236 - __main__ - INFO - numactl -C 22-43 -m 1 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221150_instance_1_cores_22-43.log +``` + +#### VI. Latency mode + +``` +ipexrun --latency-mode --log-dir ./logs resnet50.py +``` + +CPU usage is shown as below. 4 cores are used for each instance. + +![Multiple instances latency mode](../../../images/launch_script/nins_lat.gif) + +If you check your log directory, you will find directory structure as below. + +``` +. +├── resnet50.py +└── logs + ├── run_20210712221415_instances.log + ├── run_20210712221415_instance_0_cores_0-3.log + ├── run_20210712221415_instance_1_cores_4-7.log + ├── run_20210712221415_instance_2_cores_8-11.log + ├── run_20210712221415_instance_3_cores_12-15.log + ├── run_20210712221415_instance_4_cores_16-19.log + ├── run_20210712221415_instance_5_cores_20-23.log + ├── run_20210712221415_instance_6_cores_24-27.log + ├── run_20210712221415_instance_7_cores_28-31.log + ├── run_20210712221415_instance_8_cores_32-35.log + ├── run_20210712221415_instance_9_cores_36-39.log + └── run_20210712221415_instance_10_cores_40-43.log +``` + +The `run_20210712221415_instances.log` contains information and command that were used for this execution launch. + +``` +$ cat logs/run_20210712221415_instances.log +2021-07-12 22:14:15,140 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home//.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance +2021-07-12 22:14:15,140 - __main__ - INFO - OMP_NUM_THREADS=4 +2021-07-12 22:14:15,140 - __main__ - INFO - Using Intel OpenMP +2021-07-12 22:14:15,140 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-07-12 22:14:15,140 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-07-12 22:14:15,140 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so +2021-07-12 22:14:15,140 - __main__ - INFO - numactl -C 0-3 -m 0 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_0_cores_0-3.log +2021-07-12 22:14:15,143 - __main__ - INFO - numactl -C 4-7 -m 0 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_1_cores_4-7.log +2021-07-12 22:14:15,146 - __main__ - INFO - numactl -C 8-11 -m 0 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_2_cores_8-11.log +2021-07-12 22:14:15,149 - __main__ - INFO - numactl -C 12-15 -m 0 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_3_cores_12-15.log +2021-07-12 22:14:15,151 - __main__ - INFO - numactl -C 16-19 -m 0 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_4_cores_16-19.log +2021-07-12 22:14:15,154 - __main__ - WARNING - Numa Aware: cores:['20', '21', '22', '23'] on different NUMA nodes +2021-07-12 22:14:15,154 - __main__ - INFO - numactl -C 20-23 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_5_cores_20-23.log +2021-07-12 22:14:15,157 - __main__ - INFO - numactl -C 24-27 -m 1 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_6_cores_24-27.log +2021-07-12 22:14:15,159 - __main__ - INFO - numactl -C 28-31 -m 1 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_7_cores_28-31.log +2021-07-12 22:14:15,162 - __main__ - INFO - numactl -C 32-35 -m 1 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_8_cores_32-35.log +2021-07-12 22:14:15,164 - __main__ - INFO - numactl -C 36-39 -m 1 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_9_cores_36-39.log +2021-07-12 22:14:15,167 - __main__ - INFO - numactl -C 40-43 -m 1 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_10_cores_40-43.log +``` + +#### VII. Your designated number of instances + +``` +ipexrun --ninstances 4 --log-dir ./logs resnet50.py +``` + +CPU usage is shown as below. 4 main worker thread were launched, then they launched threads on all other physical cores. + +![Multiple instances designated number of instances](../../../images/launch_script/nins_cus.gif) + +If you check your log directory, you will find directory structure as below. + +``` +. +├── resnet50.py +└── logs + ├── run_20210712221305_instances.log + ├── run_20210712221305_instance_0_cores_0-10.log + ├── run_20210712221305_instance_1_cores_11-21.log + ├── run_20210712221305_instance_2_cores_22-32.log + └── run_20210712221305_instance_3_cores_33-43.log +``` + +The `run_20210712221305_instances.log` contains information and command that were used for this execution launch. + +``` +$ cat logs/run_20210712221305_instances.log +2021-07-12 22:13:05,470 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home//.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance +2021-07-12 22:13:05,470 - __main__ - INFO - OMP_NUM_THREADS=11 +2021-07-12 22:13:05,470 - __main__ - INFO - Using Intel OpenMP +2021-07-12 22:13:05,470 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-07-12 22:13:05,470 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-07-12 22:13:05,470 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so +2021-07-12 22:13:05,471 - __main__ - INFO - numactl -C 0-10 -m 0 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221305_instance_0_cores_0-10.log +2021-07-12 22:13:05,473 - __main__ - INFO - numactl -C 11-21 -m 0 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221305_instance_1_cores_11-21.log +2021-07-12 22:13:05,476 - __main__ - INFO - numactl -C 22-32 -m 1 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221305_instance_2_cores_22-32.log +2021-07-12 22:13:05,479 - __main__ - INFO - numactl -C 33-43 -m 1 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221305_instance_3_cores_33-43.log +``` + +#### VIII. Your designated number of instances and instance index + +Launcher by default runs all `ninstances` for multi-instance inference/training as shown above. You can specify `instance_idx` to independently run that instance only among `ninstances` + +``` +ipexrun --ninstances 4 --instance-idx 0 --log-dir ./logs resnet50.py +``` + +you can confirm usage in log file: + +``` +2022-01-06 13:01:51,175 - __main__ - INFO - OMP_NUM_THREADS=14 +2022-01-06 13:01:51,176 - __main__ - INFO - Using Intel OpenMP +2022-01-06 13:01:51,177 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2022-01-06 13:01:51,177 - __main__ - INFO - KMP_BLOCKTIME=1 +2022-01-06 13:01:51,177 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so +2022-01-06 13:01:51,177 - __main__ - INFO - numactl -C 0-10 -m 0 /bin/python resnet50.py 2>&1 | tee ./logs/run_20220106130151_instance_0_cores_0-13.log +``` + +``` +ipexrun --ninstances 4 --instance-idx 1 --log-dir ./logs resnet50.py +``` + +you can confirm usage in log file: + +``` +2022-01-06 13:01:51,175 - __main__ - INFO - OMP_NUM_THREADS=14 +2022-01-06 13:01:51,176 - __main__ - INFO - Using Intel OpenMP +2022-01-06 13:01:51,177 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2022-01-06 13:01:51,177 - __main__ - INFO - KMP_BLOCKTIME=1 +2022-01-06 13:01:51,177 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so +2022-01-06 13:01:51,177 - __main__ - INFO - numactl -C 11-21 -m 0 /bin/python resnet50.py 2>&1 | tee ./logs/run_20220106130151_instance_0_cores_0-13.log +``` + +### Usage of Jemalloc/TCMalloc/Default memory allocator + +Memory allocator influences performance sometime. If users do not designate desired memory allocator, the *launch* script searches them in the order of TCMalloc > Jemalloc > PyTorch default memory allocator, and takes the first matched one. + +#### Jemalloc + +__Note:__ You can set your favorite value to *MALLOC_CONF* before running the *launch* script if you do not want to use its default setting. + +``` +ipexrun --memory-allocator jemalloc --log-dir ./logs resnet50.py +``` + +you can confirm usage in log file: + +``` +2021-07-13 15:30:48,235 - __main__ - INFO - Use JeMallocl memory allocator +2021-07-13 15:30:48,235 - __main__ - INFO - MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000 +2021-07-13 15:30:48,235 - __main__ - INFO - OMP_NUM_THREADS=44 +2021-07-13 15:30:48,235 - __main__ - INFO - Using Intel OpenMP +2021-07-13 15:30:48,235 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-07-13 15:30:48,235 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-07-13 15:30:48,235 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so:/lib/libjemalloc.so +2021-07-13 15:30:48,236 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] on different NUMA nodes +2021-07-13 15:30:48,236 - __main__ - INFO - numactl -C 0-43 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210713153048_instance_0_cores_0-43.log +``` + +#### TCMalloc + +``` +ipexrun --memory-allocator tcmalloc --log-dir ./logs resnet50.py +``` + +you can confirm usage in log file: + +``` +2021-07-13 15:33:33,654 - __main__ - INFO - Use TCMalloc memory allocator +2021-07-13 15:33:33,654 - __main__ - INFO - OMP_NUM_THREADS=44 +2021-07-13 15:33:33,654 - __main__ - INFO - Using Intel OpenMP +2021-07-13 15:33:33,654 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-07-13 15:33:33,654 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-07-13 15:33:33,654 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so:/lib/libtcmalloc.so +2021-07-13 15:33:33,654 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] on different NUMA nodes +2021-07-13 15:33:33,655 - __main__ - INFO - numactl -C 0-43 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210713153333_instance_0_cores_0-43.log +``` + +#### Default memory allocator + +``` +ipexrun --memory-allocator default --log-dir ./logs resnet50.py +``` + +you can confirm usage in log file: + +``` +2021-07-13 15:36:59,784 - __main__ - INFO - OMP_NUM_THREADS=44 +2021-07-13 15:36:59,784 - __main__ - INFO - Using Intel OpenMP +2021-07-13 15:36:59,784 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-07-13 15:36:59,784 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-07-13 15:36:59,784 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so +2021-07-13 15:36:59,784 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] on different NUMA nodes +2021-07-13 15:36:59,784 - __main__ - INFO - numactl -C 0-43 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210713153659_instance_0_cores_0-43.log +``` + +### Usage of OpenMP library + +#### Intel OpenMP Library + +Generally, Intel OpenMP library brings better performance. Thus, in the *launch* script, Intel OpenMP library is used by default, if it is available. Intel OpenMP library takes environment variables like *KMP_AFFINITY* and *KMP_BLOCKTIME* to control its behavior. You can set your favorite values to them before running the *launch* script if you do not want to use the default settings. + +#### GNU OpenMP Library + +It is, however, not always that Intel OpenMP library brings better performance comparing to GNU OpenMP library. In this case, you can use knob `--omp-runtime default` to switch active OpenMP library to the GNU one. GNU OpenMP specific environment variables, *OMP_SCHEDULE* and *OMP_PROC_BIND*, for setting CPU affinity are set automatically. + +``` +ipexrun --omp-runtime default --log-dir ./logs resnet50.py +``` + +you can confirm usage in log file: + +``` +2021-07-13 15:25:00,760 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home//.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance +2021-07-13 15:25:00,761 - __main__ - INFO - OMP_SCHEDULE=STATIC +2021-07-13 15:25:00,761 - __main__ - INFO - OMP_PROC_BIND=CLOSE +2021-07-13 15:25:00,761 - __main__ - INFO - OMP_NUM_THREADS=44 +2021-07-13 15:25:00,761 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] on different NUMA nodes +2021-07-13 15:25:00,761 - __main__ - INFO - numactl -C 0-43 /bin/python resnet50.py 2>&1 | tee ./logs/run_20210713152500_instance_0_cores_0-43.log +``` diff --git a/cpu/2.4.0+cpu/_sources/tutorials/performance_tuning/torchserve.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/performance_tuning/torchserve.md.txt new file mode 100644 index 000000000..a5d8d694d --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/performance_tuning/torchserve.md.txt @@ -0,0 +1,322 @@ +# TorchServe with Intel® Extension for PyTorch\* + +TorchServe can be used with Intel® Extension for PyTorch\* to give performance boost on Intel hardware.1 +Here we show how to use TorchServe with Intel® Extension for PyTorch\*. + +1. While Intel® Extension for PyTorch\* benefits all platforms, platforms with AVX512 benefit the most. + +## Contents of this Document +* [Install Intel® Extension for PyTorch\*](#install-intel-extension-for-pytorch) +* [Serving model with Intel® Extension for PyTorch\*](#serving-model-with-intel-extension-for-pytorch) +* [TorchServe with Launcher](#torchserve-with-launcher) +* [Creating and Exporting INT8 model for Intel® Extension for PyTorch\*](#creating-and-exporting-int8-model-for-intel-extension-for-pytorch) +* [Benchmarking with Launcher](#benchmarking-with-launcher) +* [Performance Boost with Intel® Extension for PyTorch\* and Launcher](#performance-boost-with-intel-extension-for-pytorch-and-launcher) + + +## Install Intel® Extension for PyTorch\* +Refer to the documentation [here](../installation.md). + +## Serving model with Intel® Extension for PyTorch\* +After installation, all it needs to use TorchServe with Intel® Extension for PyTorch\* is to enable it in `config.properties`. +``` +ipex_enable=true +``` +Once Intel® Extension for PyTorch\* is enabled, deploying PyTorch model follows the same procedure shown [here](https://pytorch.org/serve/use_cases.html). TorchServe with Intel® Extension for PyTorch\* can deploy any model and do inference. + +## TorchServe with Launcher +Launcher is a script to automate the process of tunining configuration setting on Intel hardware to boost performance. Tuning configurations such as OMP_NUM_THREADS, thread affinity, memory allocator can have a dramatic effect on performance. Refer to [Performance Tuning Guide](./tuning_guide.md) and [Launch Script Usage Guide](./launch_script.md) for details on performance tuning with launcher. + +All it needs to use TorchServe with launcher is to set its configuration in `config.properties`. + +Add the following lines in `config.properties` to use launcher with its default configuration. +``` +ipex_enable=true +cpu_launcher_enable=true +``` + +Launcher by default uses `numactl` if it's installed to ensure socket is pinned and thus memory is allocated from local numa node. To use launcher without numactl, add the following lines in `config.properties`. +``` +ipex_enable=true +cpu_launcher_enable=true +cpu_launcher_args=--disable_numactl +``` + +Launcher by default uses only non-hyperthreaded cores if hyperthreading is present to avoid core compute resource sharing. To use launcher with all cores, both physical and logical, add the following lines in `config.properties`. +``` +ipex_enable=true +cpu_launcher_enable=true +cpu_launcher_args=--use_logical_core +``` + +Below is an example of passing multiple args to `cpu_launcher_args`. +``` +ipex_enable=true +cpu_launcher_enable=true +cpu_launcher_args=--use_logical_core --disable_numactl +``` + +Below are some useful `cpu_launcher_args` to note. Italic values are default if applicable. +1. Memory Allocator: [ PTMalloc `--use_default_allocator` | *TCMalloc `--enable_tcmalloc`* | JeMalloc `--enable_jemalloc`] + * PyTorch by default uses PTMalloc. TCMalloc/JeMalloc generally gives better performance. +2. OpenMP library: [GNU OpenMP `--disable_iomp` | *Intel OpenMP*] + * PyTorch by default uses GNU OpenMP. Launcher by default uses Intel OpenMP. Intel OpenMP library generally gives better performance. +3. Node id: [`--node_id`] + * Launcher by default uses all NUMA nodes. Limit memory access to local memories on the Nth Numa node to avoid Non-Uniform Memory Access (NUMA). + +Refer to [Launch Script Usage Guide](./launch_script.md) for a full list of tunable configuration of launcher. And refer to [Performance Tuning Guide](./tuning_guide.md) for more details. + +### Launcher Core Pinning to Boost Performance of TorchServe Multi Worker Inference +When running [multi-worker inference](https://pytorch.org/serve/management_api.html#scale-workers) with Torchserve (Required torchserve>=0.6.1), launcher pin cores to workers to boost performance. Internally, launcher equally divides the number of cores by the number of workers such that each worker is pinned to assigned cores. Doing so avoids core overlap among workers which can signficantly boost performance for TorchServe multi-worker inference. For example, assume running 4 workers on a machine with Intel(R) Xeon(R) Platinum 8180 CPU, 2 sockets, 28 cores per socket, 2 threads per core. Launcher will bind worker 0 to cores 0-13, worker 1 to cores 14-27, worker 2 to cores 28-41, and worker 3 to cores 42-55. + +CPU usage is shown below. 4 main worker threads were launched, each launching 14 threads affinitized to the assigned physical cores. +![26](https://user-images.githubusercontent.com/93151422/170373651-fd8a0363-febf-4528-bbae-e1ddef119358.gif) + + +#### Scaling workers +Additionally when dynamically [scaling the number of workers](https://pytorch.org/serve/management_api.html#scale-workers), cores that were pinned to killed workers by the launcher could be left unutilized. To address this problem, launcher internally restarts the workers to re-distribute cores that were pinned to killed workers to the remaining, alive workers. This is taken care internally, so users do not have to worry about this. + +Continuing with the above example with 4 workers, assume killing workers 2 and 3. If cores were not re-distributed after the scale down, cores 28-55 would be left unutilized. Instead, launcher re-distributes cores 28-55 to workers 0 and 1 such that now worker 0 binds to cores 0-27 and worker 1 binds to cores 28-55.2 + +CPU usage is shown below. 4 main worker threads were initially launched. Then after scaling down the number of workers from 4 to 2, 2 main worker threads were launched, each launching 28 threads affinitized to the assigned physical cores. +![worker_scaling](https://user-images.githubusercontent.com/93151422/170374697-7497c2d5-4c17-421b-9993-1434d1f722f6.gif) + +2. Serving is interrupted for few seconds while re-distributing cores to scaled workers. + +Again, all it needs to use TorchServe with launcher core pinning for multiple workers as well as scaling workers is to set its configuration in `config.properties`. + +Add the following lines in `config.properties` to use launcher with its default configuration. +``` +cpu_launcher_enable=true +``` + +## Creating and Exporting INT8 model for Intel® Extension for PyTorch\* +Intel® Extension for PyTorch\* supports both eager and torchscript mode. In this section, we show how to deploy INT8 model for Intel® Extension for PyTorch\*. Refer to [here](../features/int8_overview.md) for more details on Intel® Extension for PyTorch\* optimizations for quantization. + +### 1. Creating a serialized file +First create `.pt` serialized file using Intel® Extension for PyTorch\* INT8 inference. Here we show two examples with BERT and ResNet50. + +#### BERT + +``` +import torch +import intel_extension_for_pytorch as ipex +from transformers import BertModel + +# load the model +model = BertModel.from_pretrained('bert-base-uncased') +model = model.eval() + +# define dummy input tensor to use for the model's forward call to record operations in the model for tracing +vocab_size = model.config.vocab_size +batch_size = 1 +seq_length = 384 +dummy_tensor = torch.randint(vocab_size, size=[batch_size, seq_length]) + +from intel_extension_for_pytorch.quantization import prepare, convert + +# ipex supports two quantization schemes: static and dynamic +# default dynamic qconfig +qconfig = ipex.quantization.default_dynamic_qconfig + +# prepare and calibrate +model = prepare(model, qconfig, example_inputs=dummy_tensor) + +# convert and deploy +model = convert(model) + +with torch.no_grad(): + model = torch.jit.trace(model, dummy_tensor, check_trace=False, strict=False) + model = torch.jit.freeze(model) + +torch.jit.save(model, 'bert_int8_jit.pt') +``` + +#### ResNet50 + +``` +import torch +import intel_extension_for_pytorch as ipex +import torchvision.models as models + +# load the model +model = models.resnet50(pretrained=True) +model = model.eval() + +# define dummy input tensor to use for the model's forward call to record operations in the model for tracing +N, C, H, W = 1, 3, 224, 224 +dummy_tensor = torch.randn(N, C, H, W) + +from intel_extension_for_pytorch.quantization import prepare, convert + +# ipex supports two quantization schemes: static and dynamic +# default static qconfig +qconfig = ipex.quantization.default_static_qconfig + +# prepare and calibrate +model = prepare(model, qconfig, example_inputs=dummy_tensor, inplace=False) + +n_iter = 100 +for i in range(n_iter): + model(dummy_tensor) + +# convert and deploy +model = convert(model) + +with torch.no_grad(): + model = torch.jit.trace(model, dummy_tensor) + model = torch.jit.freeze(model) + +torch.jit.save(model, 'rn50_int8_jit.pt') +``` + +### 2. Creating a Model Archive +Once the serialized file ( `.pt`) is created, it can be used with `torch-model-archiver` as ususal. + +Use the following command to package `rn50_int8_jit.pt` into `rn50_ipex_int8.mar`. +``` +torch-model-archiver --model-name rn50_ipex_int8 --version 1.0 --serialized-file rn50_int8_jit.pt --handler image_classifier +``` +Similarly, use the following command in the [Huggingface_Transformers directory](https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers) to package `bert_int8_jit.pt` into `bert_ipex_int8.mar`. + +``` +torch-model-archiver --model-name bert_ipex_int8 --version 1.0 --serialized-file bert_int8_jit.pt --handler ./Transformer_handler_generalized.py --extra-files "./setup_config.json,./Seq_classification_artifacts/index_to_name.json" +``` + +### 3. Start TorchServe to serve the model +Make sure to set `ipex_enable=true` in `config.properties`. Use the following command to start TorchServe with Intel® Extension for PyTorch\*. +``` +torchserve --start --ncs --model-store model_store --ts-config config.properties +``` + +### 4. Registering and Deploying model +Registering and deploying the model follows the same steps shown [here](https://pytorch.org/serve/use_cases.html). + +## Benchmarking with Launcher +Launcher can be used with TorchServe official [benchmark](https://github.com/pytorch/serve/tree/master/benchmarks) to launch server and benchmark requests with optimal configuration on Intel hardware. + +In this section we provide examples of benchmarking with launcher with its default configuration. + +Add the following lines to `config.properties` in the benchmark directory to use launcher with its default setting. +``` +ipex_enable=true +cpu_launcher_enable=true +``` + +The rest of the steps for benchmarking follows the same steps shown [here](https://github.com/pytorch/serve/tree/master/benchmarks). + +`model_log.log` contains information and command that were used for this execution launch. + + +CPU usage on a machine with Intel(R) Xeon(R) Platinum 8180 CPU, 2 sockets, 28 cores per socket, 2 threads per core is shown as below: +![launcher_default_2sockets](https://user-images.githubusercontent.com/93151422/144373537-07787510-039d-44c4-8cfd-6afeeb64ac78.gif) + +``` +$ cat logs/model_log.log +2021-12-01 21:22:40,096 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home//.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance +2021-12-01 21:22:40,096 - __main__ - INFO - OMP_NUM_THREADS=56 +2021-12-01 21:22:40,096 - __main__ - INFO - Using Intel OpenMP +2021-12-01 21:22:40,096 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-12-01 21:22:40,096 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-12-01 21:22:40,096 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so +2021-12-01 21:22:40,096 - __main__ - WARNING - Numa Aware: cores:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55] in different NUMA node +``` + +CPU usage on a machine with Intel(R) Xeon(R) Platinum 8375C CPU, 1 socket, 2 cores per socket, 2 threads per socket is shown as below: +![launcher_default_1socket](https://user-images.githubusercontent.com/93151422/144372993-92b2ca96-f309-41e2-a5c8-bf2143815c93.gif) + +``` +$ cat logs/model_log.log +2021-12-02 06:15:03,981 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home//.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance +2021-12-02 06:15:03,981 - __main__ - INFO - OMP_NUM_THREADS=2 +2021-12-02 06:15:03,982 - __main__ - INFO - Using Intel OpenMP +2021-12-02 06:15:03,982 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0 +2021-12-02 06:15:03,982 - __main__ - INFO - KMP_BLOCKTIME=1 +2021-12-02 06:15:03,982 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so + +``` + +### Benchmarking with Launcher Core Pinning +As described previously in [TorchServe with Launcher](#torchserve-with-launcher), launcher core pinning boosts performance of multi-worker inference. We'll demonstrate launcher core pinning with TorchServe benchmark, but keep in mind that launcher core pinning is a generic feature applicable to any TorchServe multi-worker inference use casese. + +For example, assume running 4 workers +``` +python benchmark-ab.py --workers 4 +``` +on a machine with Intel(R) Xeon(R) Platinum 8180 CPU, 2 sockets, 28 cores per socket, 2 threads per core. Launcher will bind worker 0 to cores 0-13, worker 1 to cores 14-27, worker 2 to cores 28-41, and worker 3 to cores 42-55. + +All it needs to use TorchServe with launcher's core pinning is to enable launcher in `config.properties`. + +Add the following lines to `config.properties` in the benchmark directory to use launcher's core pinning: +``` +cpu_launcher_enable=true +``` + +CPU usage is shown as below: +![launcher_core_pinning](https://user-images.githubusercontent.com/93151422/159063975-e7e8d4b0-e083-4733-bdb6-4d92bdc10556.gif) + +4 main worker threads were launched, then each launched a num_physical_cores/num_workers number (14) of threads affinitized to the assigned physical cores. + +

+$ cat logs/model_log.log
+2022-03-24 10:41:32,223 - __main__ - INFO - Use TCMalloc memory allocator
+2022-03-24 10:41:32,223 - __main__ - INFO - OMP_NUM_THREADS=14
+2022-03-24 10:41:32,223 - __main__ - INFO - Using Intel OpenMP
+2022-03-24 10:41:32,223 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2022-03-24 10:41:32,223 - __main__ - INFO - KMP_BLOCKTIME=1
+2022-03-24 10:41:32,223 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so:/lib/libtcmalloc.so
+2022-03-24 10:41:32,223 - __main__ - INFO - numactl -C 0-13 -m 0 /bin/python -u /lib/python/site-packages/ts/model_service_worker.py --sock-type unix --sock-name /tmp/.ts.sock.9000
+
+2022-03-24 10:49:03,760 - __main__ - INFO - Use TCMalloc memory allocator
+2022-03-24 10:49:03,761 - __main__ - INFO - OMP_NUM_THREADS=14
+2022-03-24 10:49:03,762 - __main__ - INFO - Using Intel OpenMP
+2022-03-24 10:49:03,762 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2022-03-24 10:49:03,762 - __main__ - INFO - KMP_BLOCKTIME=1
+2022-03-24 10:49:03,762 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so:/lib/libtcmalloc.so
+2022-03-24 10:49:03,763 - __main__ - INFO - numactl -C 14-27 -m 0 /bin/python -u /lib/python/site-packages/ts/model_service_worker.py --sock-type unix --sock-name /tmp/.ts.sock.9001
+
+2022-03-24 10:49:26,274 - __main__ - INFO - Use TCMalloc memory allocator
+2022-03-24 10:49:26,274 - __main__ - INFO - OMP_NUM_THREADS=14
+2022-03-24 10:49:26,274 - __main__ - INFO - Using Intel OpenMP
+2022-03-24 10:49:26,274 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2022-03-24 10:49:26,274 - __main__ - INFO - KMP_BLOCKTIME=1
+2022-03-24 10:49:26,274 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so:/lib/libtcmalloc.so
+2022-03-24 10:49:26,274 - __main__ - INFO - numactl -C 28-41 -m 1 /bin/python -u /lib/python/site-packages/ts/model_service_worker.py --sock-type unix --sock-name /tmp/.ts.sock.9002
+
+2022-03-24 10:49:42,975 - __main__ - INFO - Use TCMalloc memory allocator
+2022-03-24 10:49:42,975 - __main__ - INFO - OMP_NUM_THREADS=14
+2022-03-24 10:49:42,975 - __main__ - INFO - Using Intel OpenMP
+2022-03-24 10:49:42,975 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2022-03-24 10:49:42,975 - __main__ - INFO - KMP_BLOCKTIME=1
+2022-03-24 10:49:42,975 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so:/lib/libtcmalloc.so
+2022-03-24 10:49:42,975 - __main__ - INFO - numactl -C 42-55 -m 1 /bin/python -u /lib/python/site-packages/ts/model_service_worker.py --sock-type unix --sock-name /tmp/.ts.sock.9003
+
+ +## Performance Boost with Intel® Extension for PyTorch\* and Launcher + +![pdt_perf](https://user-images.githubusercontent.com/93151422/159067306-dfd604e3-8c66-4365-91ae-c99f68d972d5.png) + + +Above shows performance improvement of Torchserve with Intel® Extension for PyTorch\* and launcher on ResNet50 and BERT-base-uncased. Torchserve official [apache-bench benchmark](https://github.com/pytorch/serve/tree/master/benchmarks#benchmarking-with-apache-bench) on Amazon EC2 m6i.24xlarge was used to collect the results2. Add the following lines in ```config.properties``` to reproduce the results. Notice that launcher is configured such that a single instance uses all physical cores on a single socket to avoid cross socket communication and core overlap. + +``` +ipex_enable=true +cpu_launcher_enable=true +cpu_launcher_args=--node_id 0 --enable_jemalloc +``` +Use the following command to reproduce the results. +``` +python benchmark-ab.py --url {modelUrl} --input {inputPath} --concurrency 1 +``` + +For example, run the following command to reproduce latency performance of ResNet50 with data type of Intel® Extension for PyTorch\* int8 and batch size of 1. Refer to [Creating and Exporting INT8 model for Intel® Extension for PyTorch\*](#creating-and-exporting-int8-model-for-intel-extension-for-pytorch) for steps to creating ```rn50_ipex_int8.mar``` file for ResNet50 with Intel® Extension for PyTorch\* int8 data type. +``` +python benchmark-ab.py --url 'file:///model_store/rn50_ipex_int8.mar' --concurrency 1 +``` + +For example, run the following command to reproduce latency performance of BERT with data type of Intel® Extension for PyTorch\* int8 and batch size of 1. Refer to [Creating and Exporting INT8 model for Intel® Extension for PyTorch\*](#creating-and-exporting-int8-model-for-intel-extension-for-pytorch) for steps to creating ```bert_ipex_int8.mar``` file for BERT with Intel® Extension for PyTorch\* int8 data type. +``` +python benchmark-ab.py --url 'file:///model_store/bert_ipex_int8.mar' --input '../examples/Huggingface_Transformers/Seq_classification_artifacts/sample_text_captum_input.txt' --concurrency 1 +``` + +3. Amazon EC2 m6i.24xlarge was used for benchmarking purpose only. For multi-core instances, Intel® Extension for PyTorch\* optimizations automatically scale and leverage full instance resources. diff --git a/cpu/2.4.0+cpu/_sources/tutorials/performance_tuning/tuning_guide.md.txt b/cpu/2.4.0+cpu/_sources/tutorials/performance_tuning/tuning_guide.md.txt new file mode 100644 index 000000000..b178106bf --- /dev/null +++ b/cpu/2.4.0+cpu/_sources/tutorials/performance_tuning/tuning_guide.md.txt @@ -0,0 +1,264 @@ +Performance Tuning Guide +======================== + +## Overview + +Intel® Extension for PyTorch\* is a Python package to extend official PyTorch. It makes the out-of-box user experience of PyTorch CPU better while achieving good performance. To fully utilize the power of Intel® architecture and thus yield high performance, PyTorch, as well as Intel® Extension for PyTorch\*, are powered by [oneAPI Deep Neural Network Library (oneDNN)](https://github.com/oneapi-src/oneDNN), an open-source cross-platform performance library of basic building blocks for deep learning applications. It is developed and optimized for Intel Architecture Processors, Intel Processor Graphics, and Xe architecture-based Graphics. + +Although default primitives of PyTorch and Intel® Extension for PyTorch\* are highly optimized, there are things users can do improve performance. Most optimized configurations can be automatically set by the launcher script. This article introduces common methods recommended by Intel developers. + +## Contents of this Document +* [Hardware Configuration](#hardware-configuration) + * [Intel CPU Structure](#intel-cpu-structure) + * [Non-Uniform Memory Access (NUMA)](#non-uniform-memory-access-numa) +* [Software Configuration](#software-configuration) + * [Channels Last](#channels-last) + * [Numactl](#numactl) + * [OpenMP](#openmp) + * [OMP_NUM_THREADS](#omp-num-threads) + * [OMP_THREAD_LIMIT](#omp-thread-limit) + * [GNU OpenMP](#gnu-openmp) + * [Intel OpenMP](#intel-openmp) + * [Memory Allocator](#memory-allocator) + * [Jemalloc](#jemalloc) + * [TCMalloc](#tcmalloc) + * [Denormal Number](#denormal-number) + * [OneDNN primitive cache](#onednn-primitive-cache) + +## Hardware Configuration + +This section briefly introduces the structure of Intel CPUs, as well as concept of Non-Uniform Memory Access (NUMA). + +### Intel CPU Structure + +There are many families of Intel CPUs. We'll use Intel® Xeon® processor Scalable family as an example to discuss an Intel CPU and how it works. Understanding this background knowledge is helpful to understand the PyTorch optimization methodologies that Intel engineers recommend. + +On the Intel® Xeon® Scalable Processors with Intel® C620 Series Chipsets, (formerly Purley) platform, each chip provides up to 28 cores. Each core has a non-inclusive last-level cache and an 1MB L2 cache. The CPU features fast 2666 MHz DDR4 memory, six memory channels per CPU, Intel Ultra Path Interconnect (UPI) high speed point-to-point processor interconnect, and more. Figure 1 shows microarchitecture of the Intel® Xeon® processor Scalable family chips. Each CPU chip consists of a number of cores, along with core-specific cache. 6 channels of DDR4 memory are connected to the chip directly. Meanwhile, chips communicates through the Intel UPI interconnect, which features a transfer speed of up to 10.4 GT/s. + +
+ +![Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture](../../../images/performance_tuning_guide/block_diagram_xeon_architecture.png) + +Figure 1: Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture. + +
+ +Usually, a CPU chip is called a socket. A typical two-socket configuration is illustrated as Figure 2. Two CPU sockets are equipped on one motherboard. Each socket is connected to up to 6 channels of memory, called its local memory, from socket perspective. Sockets are connected to each other via Intel UPI. It is possible for each socket to access memories attached on other sockets, usually called remote memory access. Local memory access is always faster than remote memory access. Meanwhile, cores on one socket share a space of high speed cache memory, which is much faster than communication via Intel UPI. Figure 3 shows an ASUS Z11PA-D8 Intel® Xeon® server motherboard, equipping with two sockets for Intel® Xeon® processor Scalable family CPUs. + +
+ +![Typical two-socket configuration](../../../images/performance_tuning_guide/two_socket_config.png) + +Figure 2: Typical two-socket configuration. + +![ASUS Z11PA-D8 Intel® Xeon® server motherboard](https://dlcdnimgs.asus.com/websites/global/products/MCCApMgGOdr9WJxN/MB-Z11PAD8-overview-01-s.jpg) + +Figure 3: An ASUS Z11PA-D8 Intel® Xeon® server motherboard. It contains two sockets for Intel® Xeon® processor Scalable family CPUs. + +
+ +### Non-Uniform Memory Access (NUMA) + +It is a good thing that more and more CPU cores are provided to users in one socket, because this brings more computation resources. However, this also brings memory access competitions. Program can stall because memory is busy to visit. To address this problem, Non-Uniform Memory Access (NUMA) was introduced. Comparing to Uniform Memory Access (UMA), in which scenario all memories are connected to all cores equally, NUMA tells memories into multiple groups. Certain number of memories are directly attached to one socket's integrated memory controller to become local memory of this socket. As described in the previous section, local memory access is much faster than remote memory access. + +Users can get CPU information with `lscpu` command on Linux to learn how many cores, sockets there on the machine. Also, NUMA information like how CPU cores are distributed can also be retrieved. The following is an example of `lscpu` execution on a machine with two Intel(R) Xeon(R) Platinum 8180M CPUs. 2 sockets were detected. Each socket has 28 physical cores onboard. Since Hyper-Threading is enabled, each core can run 2 threads. I.e. each socket has another 28 logical cores. Thus, there are 112 CPU cores on service. When indexing CPU cores, usually physical cores are indexed before logical core. In this case, the first 28 cores (0-27) are physical cores on the first NUMA socket (node), the second 28 cores (28-55) are physical cores on the second NUMA socket (node). Logical cores are indexed afterward. 56-83 are 28 logical cores on the first NUMA socket (node), 84-111 are the second 28 logical cores on the second NUMA socket (node). Typically, running Intel® Extension for PyTorch\* should avoid using logical cores to get a good performance. + +``` +$ lscpu +... +CPU(s): 112 +On-line CPU(s) list: 0-111 +Thread(s) per core: 2 +Core(s) per socket: 28 +Socket(s): 2 +NUMA node(s): 2 +... +Model name: Intel(R) Xeon(R) Platinum 8180M CPU @ 2.50GHz +... +NUMA node0 CPU(s): 0-27,56-83 +NUMA node1 CPU(s): 28-55,84-111 +... +``` + +## Software Configuration + +This section introduces software configurations that helps to boost performance. + +### Channels Last + +Take advantage of **Channels Last** memory format for image processing tasks. Comparing to PyTorch default NCHW (`torch.contiguous_format`) memory format, NHWC (`torch.channels_last`) is more friendly to Intel platforms, and thus generally yields better performance. More detailed introduction can be found at [Channels Last page](../features/nhwc.md). You can get sample codes with Resnet50 at [Example page](../examples.md). + +### Numactl + +Since NUMA largely influences memory access performance, this functionality should also be implemented in software side. + +During development of Linux kernels, more and more sophisticated implementations/optimizations/strategies had been brought out. Version 2.5 of the Linux kernel already contained basic NUMA support, which was further improved in subsequent kernel releases. Version 3.8 of the Linux kernel brought a new NUMA foundation that allowed development of more efficient NUMA policies in later kernel releases. Version 3.13 of the Linux kernel brought numerous policies that aim at putting a process near its memory, together with the handling of cases such as having memory pages shared between processes, or the use of transparent huge pages. New sysctl settings allow NUMA balancing to be enabled or disabled, as well as the configuration of various NUMA memory balancing parameters.[1] Behavior of Linux kernels are thus different according to kernel version. Newer Linux kernels may contain further optimizations of NUMA strategies, and thus have better performances. For some workloads, NUMA strategy influences performance great. + +Linux provides a tool, `numactl`, that allows user control of NUMA policy for processes or shared memory. It runs processes with a specific NUMA scheduling or memory placement policy. As described in previous section, cores share high-speed cache in one socket, thus it is a good idea to avoid cross socket computations. From a memory access perspective, bounding memory access locally is much faster than accessing remote memories. + +The following is an example of numactl usage to run a workload on the Nth socket and limit memory access to its local memories on the Nth socket. More detailed description of numactl command can be found [on the numactl man page](https://linux.die.net/man/8/numactl). + +```numactl --cpunodebind N --membind N python + + + + + + +
+ + +
+ +
+
+
+
    +
  • + +
  • + View page source +
  • +
+
+
+
+
+ +
+

Intel® Extension for PyTorch* CPU ISA Dynamic Dispatch Design Doc

+

The design document has been merged with the ISA Dynamic Dispatch feature introduction.

+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/genindex.html b/cpu/2.4.0+cpu/genindex.html new file mode 100644 index 000000000..c8f4d981f --- /dev/null +++ b/cpu/2.4.0+cpu/genindex.html @@ -0,0 +1,398 @@ + + + + + + Index — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + +
+ + +
+ +
+
+
+
    +
  • + +
  • +
  • +
+
+
+
+
+ + +

Index

+ +
+ A + | C + | E + | F + | G + | I + | L + | M + | O + | P + | R + | T + | V + +
+

A

+ + +
+ +

C

+ + + +
+ +

E

+ + +
+ +

F

+ + + +
+ +

G

+ + + +
+ +

I

+ + + +
+ +

L

+ + + +
+ +

M

+ + + +
+ +

O

+ + +
+ +

P

+ + + +
+ +

R

+ + + +
+ +

T

+ + +
+ +

V

+ + + +
+ + + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/index.html b/cpu/2.4.0+cpu/index.html new file mode 100644 index 000000000..18ad39d70 --- /dev/null +++ b/cpu/2.4.0+cpu/index.html @@ -0,0 +1,203 @@ + + + + + + + + + Intel® Extension for PyTorch* — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Intel® Extension for PyTorch*

+

Intel® Extension for PyTorch* extends PyTorch* with the latest performance optimizations for Intel hardware. +Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel XeMatrix Extensions (XMX) AI engines on Intel discrete GPUs. +Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* xpu device.

+

In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Starting from 2.1.0, specific optimizations for certain +LLMs are introduced in the Intel® Extension for PyTorch*. For more information on LLM optimizations, refer to the Large Language Models (LLM) section.

+

The extension can be loaded as a Python module for Python programs or linked as a C++ library for C++ programs. In Python scripts, users can enable it dynamically by importing intel_extension_for_pytorch.

+
+

Note

+
    +
  • GPU features are not included in CPU-only packages.

  • +
  • Optimizations for CPU-only may have a newer code base due to different development schedules.

  • +
+
+

Intel® Extension for PyTorch* has been released as an open–source project at Github. You can find the source code and instructions on how to get started at:

+ +

You can find more information about the product at:

+ +
+

Architecture

+

Intel® Extension for PyTorch* is structured as shown in the following figure:

+
+Architecture of Intel® Extension for PyTorch* +
+

Architecture of Intel® Extension for PyTorch*

+
+
+
    +
  • Eager Mode: In the eager mode, the PyTorch frontend is extended with custom Python modules (such as fusion modules), optimal optimizers, and INT8 quantization APIs. Further performance improvement is achieved by converting eager-mode models into graph mode using extended graph fusion passes.

  • +
  • Graph Mode: In the graph mode, fusions reduce operator/kernel invocation overhead, resulting in improved performance. Compared to the eager mode, the graph mode in PyTorch* normally yields better performance from the optimization techniques like operation fusion. Intel® Extension for PyTorch* amplifies them with more comprehensive graph optimizations. Both PyTorch Torchscript and TorchDynamo graph modes are supported. With Torchscript, we recommend using torch.jit.trace() as your preferred option, as it generally supports a wider range of workloads compared to torch.jit.script(). With TorchDynamo, ipex backend is available to provide good performances.

  • +
  • CPU Optimization: On CPU, Intel® Extension for PyTorch* automatically dispatches operators to underlying kernels based on detected instruction set architecture (ISA). The extension leverages vectorization and matrix acceleration units available on Intel hardware. The runtime extension offers finer-grained thread runtime control and weight sharing for increased efficiency.

  • +
  • GPU Optimization: On GPU, optimized operators and kernels are implemented and registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel GPU hardware. Intel® Extension for PyTorch* for GPU utilizes the DPC++ compiler that supports the latest SYCL* standard and also a number of extensions to the SYCL* standard, which can be found in the sycl/doc/extensions directory.

  • +
+
+
+

Support

+

The team tracks bugs and enhancement requests using GitHub issues. Before submitting a suggestion or bug report, search the existing GitHub issues to see if your issue has already been reported.

+
+
+
+
+
+
+
+
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/objects.inv b/cpu/2.4.0+cpu/objects.inv new file mode 100644 index 000000000..544e9fdd1 Binary files /dev/null and b/cpu/2.4.0+cpu/objects.inv differ diff --git a/cpu/2.4.0+cpu/py-modindex.html b/cpu/2.4.0+cpu/py-modindex.html new file mode 100644 index 000000000..37a8cab7e --- /dev/null +++ b/cpu/2.4.0+cpu/py-modindex.html @@ -0,0 +1,186 @@ + + + + + + Python Module Index — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+
    +
  • + +
  • +
  • +
+
+
+
+
+ + +

Python Module Index

+ +
+ i +
+ + + + + + + + + + + + + + + + + + + + + + +
 
+ i
+ intel_extension_for_pytorch +
    + intel_extension_for_pytorch.cpu.runtime +
    + intel_extension_for_pytorch.llm +
    + intel_extension_for_pytorch.llm.functional +
    + intel_extension_for_pytorch.llm.modules +
    + intel_extension_for_pytorch.quantization +
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/search.html b/cpu/2.4.0+cpu/search.html new file mode 100644 index 000000000..b4e05678e --- /dev/null +++ b/cpu/2.4.0+cpu/search.html @@ -0,0 +1,168 @@ + + + + + + Search — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+
    +
  • + +
  • +
  • +
+
+
+
+
+ + + + +
+ +
+ +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + + + + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/searchindex.js b/cpu/2.4.0+cpu/searchindex.js new file mode 100644 index 000000000..011e3b704 --- /dev/null +++ b/cpu/2.4.0+cpu/searchindex.js @@ -0,0 +1 @@ +Search.setIndex({"alltitles": {"$\\alpha$ Usage": [[16, "alpha-usage"]], "1. Creating a serialized file": [[32, "creating-a-serialized-file"]], "1. Defining hyperparameters to tune:": [[14, "defining-hyperparameters-to-tune"]], "1.0.0-Alpha": [[34, "id45"]], "1.0.1-Alpha": [[34, "alpha"]], "1.0.2": [[34, "id44"]], "1.1.0": [[34, "id42"]], "1.10.0": [[34, "id32"]], "1.10.100": [[34, "id31"]], "1.11.0": [[34, "id29"]], "1.11.200": [[34, "id27"]], "1.12.0": [[34, "id24"]], "1.12.100": [[34, "id23"]], "1.12.300": [[34, "id21"]], "1.13.0": [[34, "id18"]], "1.13.100": [[34, "id16"]], "1.2.0": [[34, "id39"]], "1.8.0": [[34, "id37"]], "1.9.0": [[34, "id36"]], "2. Creating a Model Archive": [[32, "creating-a-model-archive"]], "2. Defining the search spaces of the hyperparameters:": [[14, "defining-the-search-spaces-of-the-hyperparameters"]], "2.0.0": [[34, "id14"]], "2.0.100": [[34, "id12"]], "2.1.0": [[34, "id10"]], "2.1.100": [[34, "id8"]], "2.2.0": [[34, "id6"]], "2.3.0": [[34, "id4"]], "2.3.100": [[34, "id2"]], "2.4.0": [[34, "id1"]], "3. Start TorchServe to serve the model": [[32, "start-torchserve-to-serve-the-model"]], "4. Registering and Deploying model": [[32, "registering-and-deploying-model"]], "": [[14, "your-python-script"]], "API Documentation": [[2, null], [25, "api-documentation"]], "Accuracy": [[30, "accuracy"]], "Add Custom Kernel": [[17, "add-custom-kernel"]], "Algorithm: Auto-tuning of $\\alpha$.": [[16, "algorithm-auto-tuning-of-alpha"]], "Already using Jit Trace": [[10, "already-using-jit-trace"]], "Already using ipex.optimize": [[10, "already-using-ipex-optimize"]], "Architecture": [[1, "architecture"]], "Auto Channels Last": [[7, "auto-channels-last"], [9, null]], "Auto Mixed Precision (AMP)": [[7, "auto-mixed-precision-amp"], [8, null]], "Autocast Op Reference": [[8, "autocast-op-reference"]], "BERT": [[6, "bert"], [6, "id2"], [6, "id4"], [6, "id7"], [6, "id10"], [6, "id13"], [32, "bert"]], "BFloat16": [[6, "bfloat16"], [21, "bfloat16"], [26, "bfloat16"]], "Benchmarking with Launcher": [[32, "benchmarking-with-launcher"]], "Benchmarking with Launcher Core Pinning": [[32, "benchmarking-with-launcher-core-pinning"]], "Better local unit tests with pytest": [[5, "better-local-unit-tests-with-pytest"]], "Blogs & Publications": [[3, null]], "Building documentation": [[5, "building-documentation"]], "C++": [[6, "c"]], "C++ Unit Testing": [[5, "c-unit-testing"]], "CPU Channels Last Targets": [[18, "cpu-channels-last-targets"]], "CPU ISA build compiler requirement": [[17, "cpu-isa-build-compiler-requirement"]], "CPU Runtime": [[2, "module-intel_extension_for_pytorch.cpu.runtime"]], "CPU feature check": [[17, "cpu-feature-check"]], "Calibration": [[6, "calibration"]], "Channels Last": [[18, null], [33, "channels-last"]], "Cheat Sheet": [[4, null]], "Code Folder Struct": [[17, "code-folder-struct"]], "CodeGen Process": [[17, "codegen-process"]], "Codeless Optimization (Prototype)": [[10, null]], "Codeless Optimization (Prototype, NEW feature from 1.13.0)": [[7, "codeless-optimization-prototype-new-feature-from-1-13-0"]], "Command to apply ipex optimization for BF16": [[10, "command-to-apply-ipex-optimization-for-bf16"]], "Command to apply ipex optimization for FP32": [[10, "command-to-apply-ipex-optimization-for-fp32"]], "Configuration": [[30, "configuration"], [30, "id2"], [30, "id5"]], "Contents of this Document": [[32, "contents-of-this-document"], [33, "contents-of-this-document"]], "Contributing to Intel\u00ae Extension for PyTorch*": [[5, "contributing-to-intel-extension-for-pytorch"]], "Contribution": [[5, null]], "Convert to Dynamic Quantized Model and Deploy": [[15, "convert-to-dynamic-quantized-model-and-deploy"]], "Convert to Static Quantized Model and Deploy": [[15, "convert-to-static-quantized-model-and-deploy"]], "Creating and Exporting INT8 model for Intel\u00ae Extension for PyTorch*": [[32, "creating-and-exporting-int8-model-for-intel-extension-for-pytorch"]], "Default Precision": [[8, "default-precision"]], "Default memory allocator": [[31, "default-memory-allocator"]], "Default search space": [[14, "default-search-space"]], "Define QConfig": [[15, "id1"]], "Define qconfig": [[15, "define-qconfig"]], "Defining hyperparameters and their search spaces": [[14, "defining-hyperparameters-and-their-search-spaces"]], "Demos": [[28, "demos"]], "Denormal Number": [[33, "denormal-number"]], "Deployment": [[6, "deployment"]], "Design of Task": [[20, "design-of-task"]], "Detail Design": [[20, "detail-design"]], "Determining the alpha through auto-tuning": [[16, "determining-the-alpha-through-auto-tuning"]], "Developing Intel\u00ae Extension for PyTorch*": [[5, "developing-intel-extension-for-pytorch"]], "Dispatch Stub implementation: csrc/cpu/dyndisp/DispatchStub.cpp and csrc/cpu/dyndisp/DispatchStub.h": [[17, "dispatch-stub-implementation-csrc-cpu-dyndisp-dispatchstub-cpp-and-csrc-cpu-dyndisp-dispatchstub-h"]], "Distributed Inference": [[28, "distributed-inference"]], "Distributed Inference with DeepSpeed": [[29, "distributed-inference-with-deepspeed"]], "Distributed Training": [[6, "distributed-training"]], "Dynamic Dispatch Design": [[17, "dynamic-dispatch-design"]], "Dynamic Quantization": [[6, "dynamic-quantization"], [15, "dynamic-quantization"]], "Dynamic Shape": [[26, "dynamic-shape"]], "Eager Mode": [[6, "eager-mode"], [6, "id5"]], "Ease-of-use auto channels last API": [[9, "ease-of-use-auto-channels-last-api"]], "Ease-of-use graph optimization API": [[13, "ease-of-use-graph-optimization-api"]], "Easy-to-use Python API": [[7, "easy-to-use-python-api"]], "Example Usage with HuggingFace": [[10, "example-usage-with-huggingface"]], "Example of MultiStream Module": [[20, "example-of-multistream-module"]], "Example of asynchronous task": [[20, "example-of-asynchronous-task"]], "Example of configuring core binding": [[20, "example-of-configuring-core-binding"]], "Example:": [[17, "example"], [17, "id1"]], "Examples": [[6, null]], "Examples1: Basic Usage": [[20, "examples1-basic-usage"]], "Examples2: Usage with \u201cAUTO\u201d setting": [[20, "examples2-usage-with-auto-setting"]], "Examples3: Usage for models with structure inputs/outputs": [[20, "examples3-usage-for-models-with-structure-inputs-outputs"]], "FP32 and BF16 fusion patterns": [[13, "fp32-and-bf16-fusion-patterns"]], "FP32 and BF16 models": [[13, "fp32-and-bf16-models"]], "FP32 and BFloat16 with v1.10": [[30, "fp32-and-bfloat16-with-v1-10"]], "FP32 with v1.11.200 on an AWS EC2 C6i.2xlarge instance": [[30, "fp32-with-v1-11-200-on-an-aws-ec2-c6i-2xlarge-instance"]], "FP32/BF16": [[6, "fp32-bf16"], [29, "fp32-bf16"]], "Fast BERT (Prototype)": [[11, null]], "Fast BERT Optimization (Prototype, NEW feature from 2.0.0)": [[7, "fast-bert-optimization-prototype-new-feature-from-2-0-0"]], "Fast Bert (Prototype)": [[2, "fast-bert-prototype"], [6, "fast-bert-prototype"]], "Feature Description": [[11, "feature-description"], [12, "feature-description"]], "Features": [[7, null]], "Float32": [[6, "float32"]], "Folding": [[13, "folding"]], "Fusion": [[13, "fusion"]], "GNU OpenMP": [[33, "gnu-openmp"]], "GNU OpenMP Library": [[31, "gnu-openmp-library"]], "General": [[2, "general"]], "General Usage": [[26, "general-usage"]], "Get Started": [[25, "get-started"]], "Graph Capture (Prototype)": [[12, null]], "Graph Capture (Prototype, NEW feature from 1.13.0)": [[7, "graph-capture-prototype-new-feature-from-1-13-0"]], "Graph Optimization": [[2, "graph-optimization"], [7, "graph-optimization"], [13, null], [28, "graph-optimization"]], "Hardware Configuration": [[30, "hardware-configuration"], [30, "id7"], [33, "hardware-configuration"]], "Highlights": [[34, "highlights"], [34, "id3"], [34, "id5"], [34, "id7"], [34, "id9"], [34, "id11"], [34, "id13"], [34, "id15"], [34, "id17"], [34, "id19"], [34, "id22"], [34, "id25"], [34, "id28"], [34, "id30"], [34, "id33"]], "How the core binding is implemented": [[20, "how-the-core-binding-is-implemented"]], "HyperTune (Prototype)": [[14, null]], "HyperTune (Prototype, NEW feature from 1.13.0)": [[7, "hypertune-prototype-new-feature-from-1-13-0"]], "Hyperparameters": [[14, "hyperparameters"]], "I. Use all physical cores": [[31, "i-use-all-physical-cores"]], "II. Use all cores including logical cores": [[31, "ii-use-all-cores-including-logical-cores"]], "III. Use physical cores on designated nodes": [[31, "iii-use-physical-cores-on-designated-nodes"]], "INT8": [[6, "int8"], [26, "int8"]], "INT8 Quantization": [[7, "int8-quantization"]], "INT8 Recipe Tuning API (Prototype)": [[16, null]], "INT8 fusion patterns": [[13, "int8-fusion-patterns"]], "INT8 models": [[13, "int8-models"]], "INT8 with v1.11": [[30, "int8-with-v1-11"]], "IOMP preload or load during the runtime": [[20, "iomp-preload-or-load-during-the-runtime"]], "ISA Dynamic Dispatching": [[7, "isa-dynamic-dispatching"], [17, null]], "ISA intrinics specific kernel example:": [[17, "isa-intrinics-specific-kernel-example"]], "IV. Use your designated number of cores": [[31, "iv-use-your-designated-number-of-cores"]], "Indirect Access KV Cache": [[28, "indirect-access-kv-cache"]], "Inference": [[6, "inference"]], "Inference with Eager Path": [[8, "inference-with-eager-path"]], "Inference with TorchScript Path": [[8, "inference-with-torchscript-path"]], "Install Intel\u00ae Extension for PyTorch*": [[32, "install-intel-extension-for-pytorch"]], "Installation": [[24, null]], "Intel CPU Structure": [[33, "intel-cpu-structure"]], "Intel OpenMP": [[33, "intel-openmp"]], "Intel OpenMP Library": [[31, "intel-openmp-library"]], "Intel\u00ae Extension for PyTorch*": [[1, null]], "Intel\u00ae Extension for PyTorch* CPU ISA Dynamic Dispatch Design Doc": [[0, null]], "Intel\u00ae Extension for PyTorch* optimizations for quantization": [[15, null]], "Introduction": [[8, "introduction"], [19, "introduction"], [25, null]], "Jemalloc": [[31, "jemalloc"], [33, "jemalloc"]], "Kernel Stub: csrc/cpu/aten/xyz.cpp and csrc/cpu/aten/xyz.h": [[17, "kernel-stub-csrc-cpu-aten-xyz-cpp-and-csrc-cpu-aten-xyz-h"]], "Kernel implementation: csrc/cpu/aten/kernels/xyzKrnl.cpp": [[17, "kernel-implementation-csrc-cpu-aten-kernels-xyzkrnl-cpp"]], "Known Issues": [[34, "known-issues"], [34, "id20"], [34, "id26"], [34, "id34"]], "Known issue": [[9, "known-issue"], [34, "known-issue"], [34, "id47"]], "Known issues": [[20, "known-issues"], [34, "id41"]], "LLM Module Level Optimizations (Prototype)": [[2, "llm-module-level-optimizations-prototype"]], "LLM Optimizations Frontend API": [[29, null]], "LLM Performance": [[30, "llm-performance"]], "LLM Quick Start": [[23, "llm-quick-start"]], "Large Language Model (LLM)": [[6, "large-language-model-llm"]], "Large Language Models (LLM) Optimization Overview": [[28, null]], "Large Language Models (LLM, NEW feature from 2.1.0)": [[7, "large-language-models-llm-new-feature-from-2-1-0"]], "Launch Script Usage Guide": [[31, null]], "Launcher Core Pinning to Boost Performance of TorchServe Multi Worker Inference": [[32, "launcher-core-pinning-to-boost-performance-of-torchserve-multi-worker-inference"]], "Launcher Hyperparameters": [[14, "launcher-hyperparameters"]], "License": [[27, null]], "Linear Operator Optimization": [[28, "linear-operator-optimization"]], "Local linting": [[5, "local-linting"]], "Low Precision Data Types": [[28, "low-precision-data-types"]], "Memory Allocator": [[33, "memory-allocator"]], "Memory Format Is All That Matters": [[18, "memory-format-is-all-that-matters"]], "Methodology": [[13, "methodology"]], "Module Level Optimization API for customized LLM (Prototype)": [[28, "module-level-optimization-api-for-customized-llm-prototype"]], "Module uses forward method explicitly instead of the __call__ attr": [[10, "module-uses-forward-method-explicitly-instead-of-the-call-attr"]], "Motivation": [[10, "motivation"]], "Multiple instances for inference": [[31, "multiple-instances-for-inference"]], "NOTE": [[34, "note"]], "Non-Uniform Memory Access (NUMA)": [[33, "non-uniform-memory-access-numa"]], "Numactl": [[33, "numactl"]], "OMP_NUM_THREADS": [[33, "omp-num-threads"]], "OMP_THREAD_LIMIT": [[33, "omp-thread-limit"]], "OneDNN primitive cache": [[33, "onednn-primitive-cache"]], "Op Eligibility": [[8, "op-eligibility"]], "Op-Specific Behavior": [[8, "op-specific-behavior"]], "OpenMP": [[33, "openmp"]], "Operation Fusion": [[19, "operation-fusion"]], "Operator Optimization": [[7, "operator-optimization"]], "Ops that can autocast to bfloat16": [[8, "ops-that-can-autocast-to-bfloat16"]], "Ops that can autocast to float32": [[8, "ops-that-can-autocast-to-float32"]], "Ops that promote to the widest input type": [[8, "ops-that-promote-to-the-widest-input-type"]], "Optimization Methodologies": [[28, "optimization-methodologies"]], "Optimizer Fusion": [[19, null]], "Optimizer Optimization": [[7, "optimizer-optimization"]], "Others": [[34, "others"]], "Overview": [[17, "overview"], [30, "overview"], [31, "overview"], [33, "overview"]], "Performance": [[30, null], [34, "performance"]], "Performance Boost with Intel\u00ae Extension for PyTorch* and Launcher": [[32, "performance-boost-with-intel-extension-for-pytorch-and-launcher"]], "Performance Data for Intel\u00ae AI Data Center Products": [[30, "performance-data-for-intel-ai-data-center-products"]], "Performance Improvement": [[34, "performance-improvement"]], "Performance Numbers": [[30, "performance-numbers"], [30, "id1"], [30, "id4"]], "Performance Regression": [[26, "performance-regression"]], "Performance Result": [[34, "performance-result"]], "Performance Tuning Guide": [[33, null]], "Performance recipes": [[20, "performance-recipes"]], "Prepare Model": [[15, "prepare-model"]], "Prepare Model and Do Calibration": [[15, "prepare-model-and-do-calibration"]], "Prerequisite": [[11, "prerequisite"]], "Private Debug APIs": [[17, "private-debug-apis"]], "Pseudocode of Common Usage Scenarios": [[29, "pseudocode-of-common-usage-scenarios"]], "PyTorch Channels Last Memory Format APIs": [[18, "pytorch-channels-last-memory-format-apis"]], "PyTorch Strided Layout": [[18, "pytorch-strided-layout"]], "Python": [[6, "python"]], "Python Unit Testing": [[5, "python-unit-testing"]], "Quantization": [[2, "module-intel_extension_for_pytorch.quantization"]], "Quick Start": [[23, null]], "Releases": [[34, null]], "Requirements": [[20, "requirements"]], "ResNet50": [[32, "resnet50"]], "Resnet50": [[6, "resnet50"], [6, "id1"], [6, "id3"], [6, "id6"], [6, "id9"], [6, "id12"]], "Result Correctness": [[26, "result-correctness"]], "Runtime Extension": [[7, "runtime-extension"], [20, null], [26, "runtime-extension"]], "Scaling workers": [[32, "scaling-workers"]], "Select ISA level manually.": [[17, "select-isa-level-manually"]], "Serving model with Intel\u00ae Extension for PyTorch*": [[32, "serving-model-with-intel-extension-for-pytorch"]], "Single instance for inference": [[31, "single-instance-for-inference"]], "Smooth Quant Recipe Tuning API (Prototype)": [[22, null]], "Smooth Quantization Autotune": [[16, "smooth-quantization-autotune"]], "Smooth Quantization INT8": [[6, "smooth-quantization-int8"]], "SmoothQuant": [[29, "smoothquant"]], "Software Configuration": [[33, "software-configuration"]], "Software Version": [[30, "software-version"], [30, "id3"], [30, "id6"]], "Split SGD": [[21, null], [21, "id2"]], "Static Quantization": [[6, "static-quantization"], [15, "static-quantization"]], "Stochastic Gradient Descent (SGD)": [[21, "stochastic-gradient-descent-sgd"]], "Support": [[1, "support"]], "TCMalloc": [[31, "tcmalloc"], [33, "tcmalloc"]], "The origin command with ipex launch": [[10, "the-origin-command-with-ipex-launch"]], "Tips": [[5, "tips"]], "Tips and Debugging": [[5, "tips-and-debugging"]], "TorchDynamo": [[26, "torchdynamo"]], "TorchDynamo Mode (Beta, NEW feature from 2.0.0)": [[6, "torchdynamo-mode-beta-new-feature-from-2-0-0"], [6, "id11"]], "TorchScript Mode": [[6, "torchscript-mode"], [6, "id8"]], "TorchServe with Intel\u00ae Extension for PyTorch*": [[32, null]], "TorchServe with Launcher": [[32, "torchserve-with-launcher"]], "Training": [[6, "training"]], "Training Support": [[8, "training-support"]], "Troubleshooting": [[26, null]], "Unit testing": [[5, "unit-testing"]], "Usage Example": [[11, "usage-example"], [12, "usage-example"], [16, "usage-example"]], "Usage Examples": [[14, "usage-examples"], [31, "usage-examples"]], "Usage of Hypertune": [[14, "usage-of-hypertune"]], "Usage of Jemalloc/TCMalloc/Default memory allocator": [[31, "usage-of-jemalloc-tcmalloc-default-memory-allocator"]], "Usage of OpenMP library": [[31, "usage-of-openmp-library"]], "Usage of launch script": [[31, "usage-of-launch-script"]], "Use Case": [[8, "use-case"]], "Use Case not supported": [[10, "use-case-not-supported"]], "Use Cases": [[20, "use-cases"]], "User defined search space": [[14, "user-defined-search-space"]], "Using a fixed alpha": [[16, "using-a-fixed-alpha"]], "V. Throughput mode": [[31, "v-throughput-mode"]], "VI. Latency mode": [[31, "vi-latency-mode"]], "VII. Your designated number of instances": [[31, "vii-your-designated-number-of-instances"]], "VIII. Your designated number of instances and instance index": [[31, "viii-your-designated-number-of-instances-and-instance-index"]], "Vec specific kernel example:": [[17, "vec-specific-kernel-example"]], "Verified for distributed inference mode via DeepSpeed": [[28, "verified-for-distributed-inference-mode-via-deepspeed"]], "Verified for single instance mode": [[28, "verified-for-single-instance-mode"]], "Weight Only Quantization (WOQ)": [[29, "weight-only-quantization-woq"]], "Weight Only Quantization INT8/INT4": [[6, "weight-only-quantization-int8-int4"]], "What is Channels Last": [[18, "what-is-channels-last"]], "What\u2019s Changed": [[34, "what-s-changed"], [34, "id35"]], "What\u2019s New": [[34, "what-s-new"], [34, "id38"], [34, "id40"], [34, "id43"], [34, "id46"]], "Writing Channels Last Kernels": [[18, "writing-channels-last-kernels"]], "Writing documentation": [[5, "writing-documentation"]], "a. Create NHWC Memory": [[18, "a-create-nhwc-memory"]], "a. NCHW (default)": [[18, "a-nchw-default"]], "a. Status on CPU": [[18, "a-status-on-cpu"]], "a. tensor creation": [[18, "a-tensor-creation"]], "b. Create Convolution Primitive": [[18, "b-create-convolution-primitive"]], "b. NHWC (WIP for CPU)": [[18, "b-nhwc-wip-for-cpu"]], "b. Register Channels Last Kernel in ATen Native Manner": [[18, "b-register-channels-last-kernel-in-aten-native-manner"]], "b. tensor conversion": [[18, "b-tensor-conversion"]], "c. Blocked (nChw16c)": [[18, "c-blocked-nchw16c"]], "c. Register oneDNN Kernel on Channels Last": [[18, "c-register-onednn-kernel-on-channels-last"]], "c. model conversion": [[18, "c-model-conversion"]], "d. operator coverage": [[18, "d-operator-coverage"]], "default": [[9, "default"]], "disable": [[9, "disable"]], "enable": [[9, "enable"]], "ipex.llm Optimized Model List for Inference": [[28, "ipex-llm-optimized-model-list-for-inference"]], "oneDNN NHWC APIs": [[18, "onednn-nhwc-apis"]], "torch.compile (Beta, NEW feature from 2.0.0)": [[7, "torch-compile-beta-new-feature-from-2-0-0"]], "your_conf_file": [[14, "your-conf-file"]]}, "docnames": ["design_doc/cpu/isa_dyndisp", "index", "tutorials/api_doc", "tutorials/blogs_publications", "tutorials/cheat_sheet", "tutorials/contribution", "tutorials/examples", "tutorials/features", "tutorials/features/amp", "tutorials/features/auto_channels_last", "tutorials/features/codeless_optimization", "tutorials/features/fast_bert", "tutorials/features/graph_capture", "tutorials/features/graph_optimization", "tutorials/features/hypertune", "tutorials/features/int8_overview", "tutorials/features/int8_recipe_tuning_api", "tutorials/features/isa_dynamic_dispatch", "tutorials/features/nhwc", "tutorials/features/optimizer_fusion", "tutorials/features/runtime_extension", "tutorials/features/split_sgd", "tutorials/features/sq_recipe_tuning_api", "tutorials/getting_started", "tutorials/installation", "tutorials/introduction", "tutorials/known_issues", "tutorials/license", "tutorials/llm", "tutorials/llm/llm_optimize", "tutorials/performance", "tutorials/performance_tuning/launch_script", "tutorials/performance_tuning/torchserve", "tutorials/performance_tuning/tuning_guide", "tutorials/releases"], "envversion": {"sphinx": 62, "sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2}, "filenames": ["design_doc/cpu/isa_dyndisp.md", "index.rst", "tutorials/api_doc.rst", "tutorials/blogs_publications.md", "tutorials/cheat_sheet.md", "tutorials/contribution.md", "tutorials/examples.md", "tutorials/features.rst", "tutorials/features/amp.md", "tutorials/features/auto_channels_last.md", "tutorials/features/codeless_optimization.md", "tutorials/features/fast_bert.md", "tutorials/features/graph_capture.md", "tutorials/features/graph_optimization.md", "tutorials/features/hypertune.md", "tutorials/features/int8_overview.md", "tutorials/features/int8_recipe_tuning_api.md", "tutorials/features/isa_dynamic_dispatch.md", "tutorials/features/nhwc.md", "tutorials/features/optimizer_fusion.md", "tutorials/features/runtime_extension.md", "tutorials/features/split_sgd.rst", "tutorials/features/sq_recipe_tuning_api.md", "tutorials/getting_started.md", "tutorials/installation.md", "tutorials/introduction.rst", "tutorials/known_issues.md", "tutorials/license.md", "tutorials/llm.rst", "tutorials/llm/llm_optimize.md", "tutorials/performance.md", "tutorials/performance_tuning/launch_script.md", "tutorials/performance_tuning/torchserve.md", "tutorials/performance_tuning/tuning_guide.md", "tutorials/releases.md"], "indexentries": {"autotune() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.autotune", false]], "convert() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.convert", false]], "cpupool (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.CPUPool", false]], "enable_onednn_fusion() (in module intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.enable_onednn_fusion", false]], "fast_bert() (in module intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.fast_bert", false]], "fast_layer_norm() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.fast_layer_norm", false]], "fastlayernorm (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.FastLayerNorm", false]], "frozenbatchnorm2d (class in intel_extension_for_pytorch.nn)": [[7, "intel_extension_for_pytorch.nn.FrozenBatchNorm2d", false]], "get_core_list_of_node_id() (in module intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.get_core_list_of_node_id", false]], "get_smooth_quant_qconfig_mapping() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.get_smooth_quant_qconfig_mapping", false]], "get_weight_only_quant_qconfig_mapping() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.get_weight_only_quant_qconfig_mapping", false]], "indirect_access_kv_cache_attention() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.indirect_access_kv_cache_attention", false]], "indirectaccesskvcacheattention (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.IndirectAccessKVCacheAttention", false]], "intel_extension_for_pytorch": [[2, "module-intel_extension_for_pytorch", false]], "intel_extension_for_pytorch.cpu.runtime": [[2, "module-intel_extension_for_pytorch.cpu.runtime", false]], "intel_extension_for_pytorch.llm": [[2, "module-intel_extension_for_pytorch.llm", false]], "intel_extension_for_pytorch.llm.functional": [[2, "module-intel_extension_for_pytorch.llm.functional", false]], "intel_extension_for_pytorch.llm.modules": [[2, "module-intel_extension_for_pytorch.llm.modules", false]], "intel_extension_for_pytorch.quantization": [[2, "module-intel_extension_for_pytorch.quantization", false]], "interaction() (in module intel_extension_for_pytorch.nn.functional)": [[7, "intel_extension_for_pytorch.nn.functional.interaction", false]], "is_runtime_ext_enabled() (in module intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.is_runtime_ext_enabled", false]], "linear2silumul (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.Linear2SiluMul", false]], "linearadd (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearAdd", false]], "linearaddadd (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearAddAdd", false]], "lineargelu (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearGelu", false]], "linearmul (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearMul", false]], "linearnewgelu (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearNewGelu", false]], "linearrelu (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearRelu", false]], "linearsilu (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearSilu", false]], "linearsilumul (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.LinearSiluMul", false]], "mergedembeddingbag (class in intel_extension_for_pytorch.nn.modules)": [[7, "intel_extension_for_pytorch.nn.modules.MergedEmbeddingBag", false]], "mergedembeddingbagwithsgd (class in intel_extension_for_pytorch.nn.modules)": [[7, "intel_extension_for_pytorch.nn.modules.MergedEmbeddingBagWithSGD", false]], "module": [[2, "module-intel_extension_for_pytorch", false], [2, "module-intel_extension_for_pytorch.cpu.runtime", false], [2, "module-intel_extension_for_pytorch.llm", false], [2, "module-intel_extension_for_pytorch.llm.functional", false], [2, "module-intel_extension_for_pytorch.llm.modules", false], [2, "module-intel_extension_for_pytorch.quantization", false]], "multistreammodule (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.MultiStreamModule", false]], "multistreammodulehint (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.MultiStreamModuleHint", false]], "optimize() (in module intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.optimize", false]], "optimize() (in module intel_extension_for_pytorch.llm)": [[2, "intel_extension_for_pytorch.llm.optimize", false]], "pagedattention (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.PagedAttention", false]], "pin (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.pin", false]], "prepare() (in module intel_extension_for_pytorch.quantization)": [[2, "intel_extension_for_pytorch.quantization.prepare", false]], "rms_norm() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.rms_norm", false]], "rmsnorm (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.RMSNorm", false]], "rotary_embedding() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.rotary_embedding", false]], "rotaryembedding (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.RotaryEmbedding", false]], "task (class in intel_extension_for_pytorch.cpu.runtime)": [[2, "intel_extension_for_pytorch.cpu.runtime.Task", false]], "varlen_attention() (in module intel_extension_for_pytorch.llm.functional)": [[2, "intel_extension_for_pytorch.llm.functional.varlen_attention", false]], "varlenattention (class in intel_extension_for_pytorch.llm.modules)": [[2, "intel_extension_for_pytorch.llm.modules.VarlenAttention", false]], "verbose (class in intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.verbose", false]]}, "objects": {"": [[2, 0, 0, "-", "intel_extension_for_pytorch"]], "intel_extension_for_pytorch": [[2, 2, 1, "", "enable_onednn_fusion"], [2, 2, 1, "", "fast_bert"], [2, 0, 0, "-", "llm"], [2, 2, 1, "", "optimize"], [2, 0, 0, "-", "quantization"], [2, 1, 1, "", "verbose"]], "intel_extension_for_pytorch.cpu": [[2, 0, 0, "-", "runtime"]], "intel_extension_for_pytorch.cpu.runtime": [[2, 1, 1, "", "CPUPool"], [2, 1, 1, "", "MultiStreamModule"], [2, 1, 1, "", "MultiStreamModuleHint"], [2, 1, 1, "", "Task"], [2, 2, 1, "", "get_core_list_of_node_id"], [2, 2, 1, "", "is_runtime_ext_enabled"], [2, 1, 1, "", "pin"]], "intel_extension_for_pytorch.llm": [[2, 0, 0, "-", "functional"], [2, 0, 0, "-", "modules"], [2, 2, 1, "", "optimize"]], "intel_extension_for_pytorch.llm.functional": [[2, 2, 1, "", "fast_layer_norm"], [2, 2, 1, "", "indirect_access_kv_cache_attention"], [2, 2, 1, "", "rms_norm"], [2, 2, 1, "", "rotary_embedding"], [2, 2, 1, "", "varlen_attention"]], "intel_extension_for_pytorch.llm.modules": [[2, 1, 1, "", "FastLayerNorm"], [2, 1, 1, "", "IndirectAccessKVCacheAttention"], [2, 1, 1, "", "Linear2SiluMul"], [2, 1, 1, "", "LinearAdd"], [2, 1, 1, "", "LinearAddAdd"], [2, 1, 1, "", "LinearGelu"], [2, 1, 1, "", "LinearMul"], [2, 1, 1, "", "LinearNewGelu"], [2, 1, 1, "", "LinearRelu"], [2, 1, 1, "", "LinearSilu"], [2, 1, 1, "", "LinearSiluMul"], [2, 1, 1, "", "PagedAttention"], [2, 1, 1, "", "RMSNorm"], [2, 1, 1, "", "RotaryEmbedding"], [2, 1, 1, "", "VarlenAttention"]], "intel_extension_for_pytorch.nn": [[7, 1, 1, "", "FrozenBatchNorm2d"]], "intel_extension_for_pytorch.nn.functional": [[7, 2, 1, "", "interaction"]], "intel_extension_for_pytorch.nn.modules": [[7, 1, 1, "", "MergedEmbeddingBag"], [7, 1, 1, "", "MergedEmbeddingBagWithSGD"]], "intel_extension_for_pytorch.quantization": [[2, 2, 1, "", "autotune"], [2, 2, 1, "", "convert"], [2, 2, 1, "", "get_smooth_quant_qconfig_mapping"], [2, 2, 1, "", "get_weight_only_quant_qconfig_mapping"], [2, 2, 1, "", "prepare"]]}, "objnames": {"0": ["py", "module", "Python module"], "1": ["py", "class", "Python class"], "2": ["py", "function", "Python function"]}, "objtypes": {"0": "py:module", "1": "py:class", "2": "py:function"}, "terms": {"": [2, 3, 5, 8, 10, 14, 15, 18, 19, 20, 21, 22, 26, 31, 32, 33], "0": [1, 2, 4, 5, 8, 10, 11, 13, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 27, 30, 31, 32, 33], "00": [31, 34], "00000": 21, "00000000000602e7": 17, "0000012345": 21, "001": [6, 8], "0016": 30, "01": [2, 4, 7, 16, 31, 32, 34], "02": [30, 32], "02x": 30, "03": 32, "03x": 30, "04": [30, 31], "04x": 30, "05": [2, 7, 10, 30, 31], "05x": 30, "06": [2, 31, 32], "06x": 30, "07": 31, "07x": 30, "08": 31, "08x": 30, "09": [17, 31], "096": 32, "09864": 2, "09x": 30, "0x00007f3cde954000": 6, "0x00007f3ce16ac000": 6, "0x00007f3cf70fc000": 6, "0x00007f3cf985a000": 6, "0x00007f3cf98e0000": 6, "0x1": 17, "0x700001c": 30, "0x7fff": 17, "0xd0002a0": 30, "0xffff": 17, "1": [1, 2, 3, 4, 6, 8, 10, 11, 12, 13, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28, 29, 30, 31, 33], "10": [7, 14, 16, 17, 18, 21, 25, 26, 31, 32, 33], "100": [2, 4, 14, 16, 17, 30, 32], "10000": 2, "1009": 30, "100mb": 34, "1024": [30, 33], "102b": 28, "1032": 34, "10438": 2, "1053": 34, "1074": 34, "10k": 6, "10x": 30, "11": [17, 31, 32], "111": 33, "112": [26, 30, 33, 34], "117": 31, "118": 31, "11b": [28, 34], "11x": 30, "12": [6, 10, 14, 17, 30, 31, 32], "1200": 30, "12345": 21, "1234500000": 21, "1234512345": 21, "125m": 6, "127": [6, 31, 34], "128": [6, 8, 10, 13, 20, 30, 34], "128k": [2, 28, 34], "128task": 30, "1295": 34, "12b": 28, "12x": 30, "13": [3, 10, 17, 30, 31, 32, 33], "1318": 34, "1322": 34, "1328": 34, "1330": 34, "1338": 34, "1341": 34, "1353": 34, "1355": 34, "1367": 34, "1373": 34, "1376": 34, "1384": 34, "1391": 34, "1392": 34, "13b": [28, 30, 34], "13x": 30, "14": [31, 32, 34], "140": 31, "1414": 34, "1419": 34, "143": 31, "146": 31, "1473": 34, "1488": 34, "149": 31, "14x": 30, "15": [14, 17, 30, 31, 32], "151": 31, "1513": 34, "1517": 34, "154": 31, "1563": 34, "1564": 34, "1566": 34, "1568": 34, "157": 31, "1580": 34, "1585": 34, "1587": 34, "1589": 34, "159": 31, "1590": 34, "1592": 34, "1593": 34, "1594": 34, "15x": 30, "16": [2, 17, 20, 21, 30, 31, 32], "160": 30, "162": 31, "164": 31, "1664": 34, "167": 31, "1677": 34, "1682": 34, "1688": 34, "1695": 34, "16gb": 30, "16x": 30, "16xlarg": 30, "17": [6, 30, 31, 32], "170": 30, "175": 31, "176": 31, "177": 31, "17th": 30, "18": [30, 31, 32], "18x": 30, "19": [7, 30, 31, 32, 34], "199": 30, "19x": 30, "1_6b": 28, "1b7": 28, "1d": 18, "1e": [2, 7, 10, 16], "1mb": 33, "2": [1, 2, 3, 8, 10, 11, 16, 17, 18, 20, 21, 25, 26, 27, 28, 29, 30, 31, 33], "20": [2, 7, 18, 30, 31, 32, 34], "2006080250": 30, "200m": 33, "2017": 3, "2019": 3, "2020": 3, "2021": [3, 17, 31, 32], "2022": [3, 31, 32], "2023": [2, 3, 30], "2048": [2, 6], "205": 34, "20b": 28, "20x": 30, "21": [30, 31, 32], "2104": 2, "2105": 30, "2137": 34, "2195": 34, "2198": 34, "21x": 30, "22": [6, 30, 31, 32], "220m": 34, "220mb": 34, "2211": 2, "2229": 34, "223": 32, "2236": 34, "224": [6, 8, 10, 12, 13, 30, 32, 34], "224m": 34, "2251": 34, "2253": 34, "2257": 34, "2264": 34, "2275": 34, "2278": 34, "2280": 34, "2283": 34, "2290": 34, "2292": 34, "2299": 34, "23": [21, 31, 32], "2315": 34, "2317": 34, "2319": 34, "233": 31, "2334": 34, "2349": 34, "235": 31, "236": 31, "2392": 34, "24": [31, 32], "2412": 34, "2433": 34, "244": 13, "2468": 34, "2469": 34, "2473": 34, "2476": 34, "2480": 34, "2491": 34, "24x": 30, "24xlarg": 32, "25": [31, 32], "2511": 34, "2550": 34, "256": [2, 30], "2561": 34, "2568": 34, "256gb": 30, "2584": 34, "26": [30, 31, 32], "2613": 34, "2617": 34, "2627": 34, "2631": 34, "2641": 34, "2663": 34, "2666": 33, "2675": 34, "26x": 30, "27": [31, 32, 33], "2704": 34, "2733": 34, "274": 32, "2747": 34, "278": 34, "27x": 30, "28": [10, 14, 16, 30, 31, 32, 33, 34], "2883": 34, "29": [7, 31, 32], "2910": 34, "2911": 34, "2928": 34, "29500": [6, 31], "2985": 34, "2987": 34, "29x": 30, "2b": 28, "2d": 18, "2nd": 28, "2x": 34, "3": [2, 5, 6, 7, 8, 10, 12, 13, 14, 16, 17, 18, 20, 21, 28, 30, 31, 33], "30": [31, 32], "3030": 34, "305": 30, "3079": 34, "3080": 34, "30b": 28, "30ghz": 30, "30x": 30, "31": [31, 32], "3116": 34, "3143": 34, "31x": 30, "32": [2, 6, 18, 21, 23, 30, 31, 32], "3200": 30, "32x": 30, "32x16d": 30, "33": [17, 31, 32], "339081764221191": 14, "33x": 30, "34": [31, 32], "35": [31, 32], "355": 31, "356": 31, "35x": 30, "36": [30, 31, 32], "36x": 30, "37": [31, 32, 34], "38": [31, 32], "384": [10, 32, 34], "384task": 30, "38x": 30, "39": [30, 31, 32, 34], "39x": 30, "3b": 28, "3d": 34, "3e": [10, 34], "3rd": [3, 7, 21, 30, 34], "4": [2, 6, 11, 13, 14, 18, 20, 23, 28, 30, 31, 33], "40": [30, 31, 32, 34], "407": 34, "409": 26, "4096": [2, 33], "40b": 28, "40mb": 34, "41": [31, 32], "42": [31, 32], "425": 34, "43": [6, 11, 31, 32], "432": 34, "438": 34, "44": [30, 31, 32], "44x": 30, "45": [31, 32], "452": 34, "45x": 30, "46": [31, 32], "47": [31, 32], "470": 31, "471": 31, "473": 31, "476": 31, "479": 31, "47x": 30, "48": [30, 31, 32], "48x": 30, "49": [30, 31, 32], "49786": 34, "4bit": 34, "4k": 28, "4th": [28, 30], "4x": 3, "5": [2, 6, 10, 13, 14, 16, 17, 18, 19, 20, 21, 22, 26, 28, 30, 31, 32, 33, 34], "50": [18, 31, 32], "50ghz": 33, "51": [31, 32], "512": [1, 6, 11, 16, 25, 28, 31], "513": 31, "52": [31, 32], "524": 34, "53": [31, 32], "531": 34, "54": [31, 32], "55": [31, 32, 33], "551": 34, "55x": 30, "56": [30, 31, 32, 33], "57": 31, "57x": 30, "58": [17, 31], "589": 34, "58x": 30, "59": 31, "591": 31, "5d": 16, "5m": 34, "5mb": 34, "5rc3": 34, "5x": 34, "6": [2, 5, 7, 11, 14, 20, 30, 31, 32, 33, 34], "60": 31, "602": 34, "61": 31, "62": 31, "62x": 30, "63": [31, 34], "64": [2, 8, 10, 16, 20, 30, 31, 34], "642": 34, "647": 34, "648": 34, "64byte": 34, "64gb": 30, "65": 31, "654": 31, "655": 31, "65536": 33, "657": 34, "66": [17, 31, 34], "67": [30, 31, 34], "674": 34, "67x": 30, "68": [31, 34], "684": 34, "685": 34, "69": [30, 31], "692": 34, "6b": [2, 28, 30], "7": [10, 14, 17, 20, 21, 31, 32, 34], "70": 31, "70b": [28, 34], "71": 31, "711": 34, "71x": 30, "72": 31, "73": 31, "74": 31, "75": [30, 31], "75x": 30, "76": [30, 31], "760": [31, 32], "761": [31, 32], "762": 32, "763": 32, "764": 31, "768gb": 30, "77": 31, "77x": 30, "78": [30, 31], "784": 31, "787": 34, "78x": 30, "79": [30, 31], "7b": [6, 28, 30, 34], "7f": 16, "7m": 34, "7x": 34, "8": [14, 16, 30, 31, 32, 33], "80": [5, 30, 31], "81": [30, 31], "8180": 32, "8180m": [14, 33], "81x": 30, "82": 31, "822": 34, "83": [31, 33], "8375c": 32, "8380": 30, "8380h": 30, "83x": 30, "84": [6, 30, 31, 33], "85": [30, 31], "85x": 30, "86": [30, 31], "87": 31, "88": 31, "8b": 28, "8x": 18, "8x7b": 28, "9": [6, 7, 14, 17, 23, 25, 31, 32], "9000": 32, "9000000000": [31, 33], "9001": 32, "9002": 32, "9003": 32, "90ghz": 30, "92": 30, "93": 30, "96": 30, "96x": 30, "97": 30, "975": 32, "98": 30, "981": 32, "982": 32, "99": [16, 30, 34], "992": 34, "A": [2, 5, 6, 7, 10, 11, 17, 26, 28, 31, 33, 34], "And": [15, 20, 32, 34], "As": [10, 19, 20, 28, 31, 32, 33, 34], "At": [7, 17], "But": [17, 18], "By": [17, 31, 33], "For": [1, 2, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 18, 19, 20, 21, 23, 24, 25, 26, 28, 29, 31, 32, 33, 34], "If": [2, 5, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 20, 26, 31, 32, 33, 34], "In": [1, 2, 6, 7, 8, 12, 16, 17, 18, 19, 21, 23, 28, 31, 32, 33, 34], "It": [2, 6, 7, 8, 10, 13, 17, 18, 20, 21, 23, 26, 29, 31, 33, 34], "Its": 28, "NOT": [18, 31], "No": [2, 18, 34], "Not": 2, "ON": 30, "On": [1, 2, 7, 18, 28, 33], "One": [3, 18, 19, 31, 33], "Such": 17, "The": [0, 1, 2, 5, 6, 7, 8, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28, 29, 30, 31, 32, 33, 34], "Then": 32, "There": [14, 16, 20, 33, 34], "These": [1, 5, 6, 7, 8, 13, 28], "To": [2, 5, 6, 7, 10, 13, 15, 16, 17, 18, 20, 21, 23, 28, 32, 33, 34], "Will": [6, 18], "With": [1, 2, 7, 10, 20, 31, 34], "_": [13, 15, 16, 17, 18, 20, 30, 31, 32, 33, 34], "___": 13, "_____": 13, "__init__": [5, 6, 8, 10, 16, 20, 26, 34], "__m256i": 17, "__m512": 17, "__m512i": 17, "__main__": [26, 31, 32, 34], "__name__": [26, 34], "_appli": 18, "_build": 5, "_c": [17, 26], "_cmp_ord_q": 17, "_core": 31, "_cvt_fp32_to_bf16": 17, "_get_current_isa_level": 17, "_get_highest_binary_support_isa_level": 17, "_get_highest_cpu_support_isa_level": 17, "_jit_set_texpr_fuser_en": 26, "_lu_with_info": 8, "_mm256_mask_storeu_epi16": 17, "_mm256_storeu_si256": 17, "_mm512_add_epi32": 17, "_mm512_and_si512": 17, "_mm512_castps_si512": 17, "_mm512_cmp_ps_mask": 17, "_mm512_cvtneps_pbh": 17, "_mm512_cvtusepi32_epi16": 17, "_mm512_loadu_p": 17, "_mm512_mask_blend_epi32": 17, "_mm512_maskz_loadu_p": 17, "_mm512_set1_epi32": 17, "_mm512_srli_epi32": 17, "_native_multi_head_attent": 8, "_reorder_cach": 2, "_timestamp_inst": 31, "_timestamp_instance_": 31, "ab": [13, 32], "abi": [6, 17, 34], "abil": 16, "abl": 15, "abnorm": [26, 34], "about": [1, 2, 5, 7, 13, 16, 32, 33, 34], "abov": [2, 5, 10, 19, 28, 30, 31, 32], "absolut": [2, 31], "abstract": [2, 11, 20], "acceler": [1, 2, 3, 6, 7, 13, 28, 29, 30, 34], "accept": [2, 34], "access": [2, 6, 7, 18, 19, 32, 34], "accommod": 18, "accompani": 34, "accord": [2, 13, 28, 33, 34], "accordingli": 16, "account": 6, "accu": 16, "accur": 8, "accuraci": [2, 3, 6, 7, 8, 15, 16, 21, 22, 26, 28, 34], "accuracy_criterion": [2, 4, 16, 34], "accuracy_criterion_typ": 2, "accuracy_criterion_valu": 2, "achang": 15, "achiev": [1, 2, 6, 7, 28, 33, 34], "across": 16, "act": 34, "act_ic_observ": 2, "act_observ": 2, "act_quant_mod": 2, "action": [6, 23], "activ": [2, 6, 7, 15, 16, 20, 28, 31, 33], "actual": [18, 21], "acycl": 13, "ad": [2, 7, 10, 33, 34], "adagrad": [19, 21], "adagrad_fused_step": 19, "adagrad_step": 19, "adam": 34, "adapt": 7, "adaptive_avg_pool3d": 8, "adaptive_max_pool3d": 8, "adaptiveaveragepoolingkrnl": 17, "add": [2, 5, 7, 8, 13, 14, 19, 21, 32, 34], "add_": 19, "add_argu": [6, 23], "add_casual_mask": 2, "add_execut": 6, "add_help": [6, 23], "addbmm": 8, "addcdiv_": 19, "addcmul_": 19, "addit": [2, 6, 7, 17, 21, 28, 34], "addition": 32, "addlayernorm": 34, "addmm": 8, "addmm_": 8, "addr": 31, "address": [7, 18, 31, 32, 33, 34], "addtion": 17, "adjust": 16, "adopt": [28, 34], "advanc": [1, 2, 6, 7, 16, 25, 28], "advantag": [1, 2, 7, 9, 12, 18, 21, 25, 30, 31, 33], "aes_ni": 17, "affect": [2, 31], "affin": [7, 10, 15, 20, 31, 32, 33], "affinit": 32, "after": [2, 5, 7, 13, 20, 21, 23, 24, 32, 33, 34], "afterward": [31, 33], "ag": 7, "again": [5, 19, 32], "against": 6, "agre": 5, "ahead": 5, "ai": [1, 2, 3, 6, 7, 28], "aim": [7, 10, 16, 33], "aka": [7, 18], "albert": 34, "algorithm": [2, 13, 18, 30, 34], "alia": 2, "alibi": 2, "alibi_slop": 2, "align": [17, 18, 21, 34], "aliv": 32, "all": [2, 5, 6, 8, 13, 14, 17, 19, 20, 28, 29, 32, 33, 34], "all_logical_cor": 14, "all_physical_cor": 14, "allcat": 2, "allenai": 26, "alloc": [2, 10, 20, 28, 30, 32, 34], "allow": [2, 8, 14, 16, 22, 33, 34], "allreduc": 2, "almost": 18, "along": [2, 5, 6, 21, 33, 34], "alpha": [2, 6, 19, 22], "alpha_max": [16, 22], "alpha_min": [16, 22], "alpha_step": [16, 22], "alphafold2": 34, "alreadi": [1, 5, 18, 28, 33], "also": [1, 2, 6, 7, 10, 13, 14, 16, 18, 19, 28, 30, 31, 33, 34], "altern": [2, 6, 18], "although": [2, 33], "alwai": [5, 6, 7, 8, 18, 31, 33, 34], "amazon": 32, "among": [2, 31, 32, 33], "amount": [2, 16, 26, 28, 33], "amp": [4, 6, 10, 23, 26, 34], "amp_dtyp": [6, 23], "amp_en": [6, 23], "ampconf": 34, "amplifi": 1, "amx": [1, 3, 6, 7, 17, 25, 28, 30], "amx_bf16": 17, "amx_int8": 17, "amx_til": 17, "an": [1, 2, 5, 6, 7, 8, 10, 11, 13, 14, 16, 17, 18, 19, 20, 21, 26, 31, 32, 33, 34], "anaconda": 17, "analysi": 33, "ani": [2, 5, 8, 10, 17, 18, 32, 34], "announc": 34, "anonym": 17, "anoth": [14, 31, 33, 34], "answer": [18, 30], "anymor": [7, 34], "anyplac": 4, "ao": [2, 6, 15], "apach": [27, 32], "api": [1, 3, 6, 10, 11, 15, 20, 26, 33, 34], "app": [6, 34], "append": [6, 7], "append_torchlib_if_found": 6, "appli": [2, 6, 7, 8, 12, 13, 16, 18, 19, 21, 23, 26, 28, 29, 31, 34], "applic": [1, 2, 7, 20, 28, 32, 33], "apply_funct": 2, "appropri": 33, "apr": 3, "ar": [1, 2, 3, 5, 6, 7, 8, 10, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28, 29, 30, 31, 32, 33, 34], "arang": [2, 6, 16], "arbitrari": 2, "arc": 3, "architectur": [2, 28, 30, 33], "area": [7, 14], "aren": 5, "arg": [2, 4, 6, 7, 14, 16, 19, 23, 31, 32, 34], "argc": 6, "argmax": 16, "argpars": [6, 23], "argument": [2, 6, 7, 22, 26, 31], "argumentpars": [6, 23], "argv": 6, "around": 31, "arrai": 18, "articl": [30, 33], "arxiv": 2, "ask": 5, "assign": [18, 31, 32, 33], "assum": [2, 7, 8, 23, 32, 33, 34], "asu": 33, "async": [20, 34], "asynchron": [2, 7], "aten": [2, 6, 7, 34], "aten_cpu_cap": 17, "attach": 33, "attent": [1, 2, 7, 28, 34], "attention_mask": [2, 6], "attention_mask_pad": 6, "attn_output": 2, "attn_weight": 2, "attribut": 18, "aug": [3, 30], "auto": [2, 6, 10, 17, 18, 22, 23, 26, 28, 31, 33, 34], "auto_alpha_arg": 16, "auto_ipex": 34, "auto_kernel_select": [2, 7, 30], "autocast": [4, 6, 7, 10, 23, 34], "autoclass": 5, "autoconfig": [6, 23], "autofunct": 5, "autom": [4, 7, 8, 14, 31, 32, 34], "automat": [1, 2, 6, 7, 9, 10, 12, 13, 15, 16, 18, 22, 28, 31, 32, 33, 34], "automaticlli": 2, "automixprecis": 34, "automodelforcausallm": [6, 23, 29, 34], "autotoken": [6, 23], "autotp": 28, "autotun": [2, 4, 22, 34], "avaiabl": 2, "avail": [1, 2, 6, 7, 11, 17, 20, 22, 23, 29, 31, 33, 34], "avg_pool3d": 8, "avoid": [2, 10, 20, 21, 26, 31, 32, 33, 34], "avx": [1, 6, 17, 25, 28], "avx2": [17, 26, 34], "avx256": 17, "avx2_vnni": 17, "avx512": [7, 17, 18, 32, 34], "avx512_4fmap": 17, "avx512_4vnniw": 17, "avx512_bf16": 17, "avx512_bitalg": 17, "avx512_bw": 17, "avx512_cd": 17, "avx512_core_vnni": 34, "avx512_dq": 17, "avx512_er": 17, "avx512_f": 17, "avx512_fp16": 17, "avx512_ifma": 17, "avx512_pf": 17, "avx512_vbmi": 17, "avx512_vbmi2": 17, "avx512_vl": 17, "avx512_vnni": 17, "avx512_vp2intersect": 17, "avx512_vpclmul": 17, "avx512_vpopcntdq": 17, "avx_vnni": 17, "awar": [18, 20, 31, 32], "awq": 34, "b": [7, 8, 16, 28], "back": [2, 6, 12, 17, 18, 21, 26], "backbon": 2, "backend": [1, 2, 3, 6, 7, 12, 13, 16, 17, 23, 26, 28, 31, 33, 34], "background": 33, "background_thread": [31, 33], "backpropag": 16, "backward": [6, 7, 8, 16, 21, 33, 34], "bactchnorm": 34, "baddbmm": 8, "bag": [26, 34], "baichuan": [2, 28, 34], "baichuan2": [28, 34], "bake": 34, "balanc": [7, 16, 22, 33], "bandwidth": 28, "base": [1, 2, 3, 4, 5, 6, 7, 10, 11, 17, 20, 21, 26, 28, 29, 30, 32, 33, 34], "base_dir": 29, "base_text_classif": 30, "baselin": [16, 22, 34], "basic": [2, 4, 16, 21, 33, 34], "batch": [2, 6, 7, 13, 16, 18, 20, 23, 26, 30, 32, 34], "batch_decod": [6, 23], "batch_id": 6, "batch_idx": [6, 13], "batch_siz": [2, 6, 11, 13, 16, 18, 23, 32], "batchnorm": [13, 17, 18, 26, 34], "batchnorm2d": [7, 10, 26, 34], "batchsiz": [2, 20], "beam": [2, 28], "beam_idx": 2, "beam_idx_tmp": 6, "beam_width": 28, "becam": 34, "becaus": [8, 17, 18, 21, 28, 33, 34], "becom": [7, 28, 33], "been": [0, 1, 6, 7, 10, 17, 18, 28, 31, 33, 34], "beeter": 28, "befor": [1, 2, 5, 6, 13, 14, 17, 18, 20, 31, 33, 34], "begin": 5, "beginn": 16, "behavior": [2, 20, 31, 33], "behaviour": 10, "being": [7, 33], "believ": [8, 18], "below": [6, 8, 10, 14, 19, 20, 21, 22, 23, 26, 28, 31, 32, 33, 34], "bench": 32, "benchmark": [6, 26, 30, 31, 34], "benefici": 18, "benefit": [6, 7, 8, 10, 20, 21, 28, 32, 33, 34], "benifit": 2, "bert": [3, 4, 10, 30, 34], "bert_int8_jit": 32, "bert_ipex_int8": 32, "bertmodel": [4, 6, 11, 32], "bertmodelmodel": 4, "besid": [28, 33, 34], "best": [2, 6, 7, 8, 14, 16, 17, 22, 24, 28, 33, 34], "beta": [23, 26], "better": [1, 2, 6, 7, 15, 18, 20, 28, 31, 32, 33, 34], "between": [7, 8, 17, 20, 33, 34], "beyond": 7, "bf16": [2, 3, 7, 17, 19, 21, 23, 26, 28, 30, 34], "bf16_gw": 21, "bf16_w": 21, "bfloat16": [2, 3, 4, 7, 10, 11, 17, 18, 23, 29, 31, 34], "bfp16": 34, "bia": [2, 8, 20, 34], "bias_kei": 2, "big": [7, 18], "bigcod": 28, "bigscienc": 28, "bin": [5, 6, 17, 31, 32], "binari": [5, 6, 7, 8, 17, 34], "binary_cross_entropi": 8, "binary_cross_entropy_with_logit": 8, "bind": [6, 7, 31, 32, 33, 34], "bio": 30, "bit": [21, 28], "blob": 2, "block": [2, 5, 16, 20, 22, 28, 33, 34], "block_numb": 2, "block_siz": 2, "block_tabl": 2, "blocktim": 31, "blockwis": 16, "blog": [2, 34], "bloom": [2, 28], "bmm": [8, 34], "bmp": 18, "bn": [2, 10, 15, 26, 34], "bn_fold": 2, "bodi": 17, "bool": [2, 14], "boolean": [7, 34], "booltensor": 7, "boost": [3, 6, 7, 9, 21, 30, 31, 33, 34], "both": [1, 2, 6, 7, 16, 18, 19, 21, 28, 29, 31, 32, 33, 34], "bother": 16, "bottl": 19, "bottleneck": [2, 28], "bottom": 21, "bound": [19, 20, 28, 33], "box": [6, 10, 33], "branch": [1, 7, 30], "break": [6, 16, 34], "brew": 5, "brief": [18, 28, 34], "briefli": 33, "bring": [2, 6, 7, 9, 15, 16, 21, 28, 31, 33, 34], "broad": [7, 9, 34], "broader": 34, "brought": [33, 34], "buffer": [2, 28], "bug": [1, 5, 34], "bui": 21, "build": [6, 28, 33, 34], "built": [7, 17, 20, 34], "busi": 33, "c": [1, 7, 8, 16, 17, 20, 26, 28, 31, 32, 33, 34], "c1": 20, "c10": [6, 17], "c620": 33, "cach": [2, 5, 7, 19, 20, 30, 34], "cache_weight_for_large_batch": 2, "caff": 3, "calcul": [1, 2, 8, 16, 21, 22], "cali_dataset": 34, "calib_dataload": [2, 6, 16, 34], "calib_dataset": [6, 29], "calib_evalu": 6, "calib_func": 2, "calib_sampl": 29, "calibr": [2, 13, 22, 26, 29, 30, 32, 34], "calibrated_model": 34, "calibration_data_load": [4, 6, 13], "calibration_data_set": [15, 34], "calibration_model": 29, "calibration_sampl": 6, "call": [2, 6, 8, 13, 17, 18, 21, 26, 32, 33, 34], "caller": [26, 34], "can": [1, 2, 5, 6, 7, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 23, 26, 28, 29, 30, 31, 32, 33, 34], "cannot": [8, 19, 26, 31, 34], "canon": 18, "capabl": [3, 17, 34], "capac": [21, 30], "captur": [4, 34], "card": 18, "care": 32, "carri": 30, "case": [2, 6, 7, 9, 12, 16, 17, 18, 28, 31, 33, 34], "cases": 32, "cast": [2, 8, 21, 28], "casual": 26, "cat": [8, 31, 32, 34], "catch": 6, "categor": 7, "categori": [8, 34], "caus": [2, 7, 21, 26, 28, 31, 33, 34], "causal": 2, "cc": [5, 6, 17], "ccl": [6, 31, 34], "cd": [5, 6], "cdist": 8, "center": 34, "cento": 30, "cerr": 6, "certain": [1, 7, 26, 28, 29, 31, 33], "ch_axi": 2, "chain": 21, "chang": [2, 5, 6, 7, 8, 10, 11, 12, 15, 17, 18, 20, 23, 25, 26, 29, 31], "changed_onli": 5, "changelog": 34, "channel": [2, 3, 10, 15, 16, 26, 34], "channels_last": [6, 7, 18, 23, 33, 34], "char": 6, "charact": 5, "chat": 28, "chatglm": [2, 28], "chatglm2": [28, 34], "chatglm3": [28, 34], "cheat": 23, "check": [2, 5, 6, 7, 13, 18, 28, 29, 31, 34], "check_trac": [6, 13, 32], "checkpoint": [2, 6, 29], "checkpoints_json": 29, "chip": 33, "chipset": 33, "choic": [6, 21, 23, 31], "choleski": 8, "cholesky_invers": 8, "cholesky_solv": 8, "choos": [6, 8, 20, 23, 31, 33, 34], "chosen": [8, 14, 17], "chw": 18, "chwn": 18, "ci": 5, "cifar10": [6, 13], "circumst": 8, "clamp": 13, "clang": 5, "class": [2, 5, 6, 7, 8, 10, 16, 20, 26, 34], "classif": [26, 30], "claus": [7, 10, 19], "clean": 5, "clear": 10, "clibrat": 34, "click": 3, "clone": 5, "close": [18, 31, 33], "cloud": 3, "clr": 19, "cmake": [5, 6, 17, 34], "cmake_minimum_requir": 6, "cmakefil": 17, "cmakelint": 5, "cmakelist": 6, "cnn": [7, 18, 26, 30, 33, 34], "co": [2, 34], "coco": 30, "code": [1, 2, 5, 6, 7, 10, 11, 12, 13, 18, 19, 21, 23, 24, 26, 27, 29, 33, 34], "codegen": [2, 28, 34], "codeless": 31, "codellama": 28, "codenam": 34, "collabor": 3, "collate_batch": 6, "collate_fn": 6, "collect": [6, 32, 33, 34], "column": 6, "com": [2, 5, 34], "combin": [2, 12, 14, 28, 31, 34], "come": 33, "comma": 33, "command": [4, 5, 6, 14, 23, 31, 32, 33, 34], "comment": [5, 14, 17, 22, 34], "commit": 5, "common": [17, 21, 28, 31, 33], "commonli": [7, 28, 33, 34], "commun": [6, 28, 31, 32, 33, 34], "communication_backend_nam": 29, "compact": [31, 32, 33], "compar": [1, 2, 7, 13, 18, 21, 26, 28, 30, 31, 33, 34], "compat": [17, 21], "compet": 33, "competit": 33, "compil": [1, 5, 6, 23, 26, 33, 34], "complet": [5, 6, 14, 18, 22, 29, 33], "complex": 17, "complexdoubl": 17, "complexfloat": 17, "complic": [26, 31, 33], "complier": 17, "compon": [15, 26, 27, 28], "compos": [6, 13], "comprehens": [1, 34], "compress": 2, "compressor": [3, 7, 16, 22, 34], "compris": 18, "compuat": 13, "comput": [2, 6, 7, 13, 15, 16, 18, 20, 21, 28, 30, 31, 32, 33, 34], "concat": [2, 20, 26, 28, 34], "concat_fp32_from_bf16": 21, "concat_linear": 2, "concat_output": 2, "concaten": [2, 21], "concept": [18, 33], "concern": 7, "conclud": [30, 34], "conclus": 18, "concurr": [32, 33], "conda": [5, 33], "conda_prefix": [31, 32], "condit": 27, "conduct": 7, "conf": [4, 13, 14, 31, 34], "conf_fil": [14, 34], "confer": 3, "config": [2, 6, 11, 23, 31, 32], "configur": [2, 4, 6, 7, 14, 15, 16, 17, 31, 32, 34], "confirm": 31, "conflict": [7, 17], "connect": 33, "consecut": 33, "consider": 16, "consist": [16, 28, 33, 34], "const": [6, 17], "constant": 13, "constraint": [2, 34], "construct": [2, 7, 13], "consum": [7, 14], "consumpt": 34, "contain": [2, 5, 6, 13, 17, 26, 31, 32, 33, 34], "containeraliasingtest": 5, "content": [29, 34], "context": [2, 5, 6, 8, 20, 28, 33, 34], "context_len": 2, "contigu": [6, 13, 18, 33, 34], "contiguous_format": [18, 33], "continu": [31, 32, 34], "contribut": [28, 31, 34], "control": [1, 2, 7, 20, 26, 31, 33, 34], "conv": [2, 8, 10, 13, 15, 20, 26, 34], "conv1d": [8, 13], "conv2": 20, "conv2d": [2, 7, 8, 10, 13, 18, 20, 26, 34], "conv3d": [8, 13, 34], "conv_bn": 2, "conv_bn_fold": [2, 26, 34], "conv_tbc": 8, "conv_transpose1d": 8, "conv_transpose2d": 8, "conv_transpose3d": 8, "conveni": [8, 34], "convers": [2, 8, 13, 34], "convert": [1, 2, 4, 6, 7, 8, 9, 10, 13, 16, 17, 18, 20, 23, 26, 32, 34], "convert_model": [4, 13, 15, 16], "converted_model": [4, 6, 26, 34], "convolut": [2, 6, 7, 13, 20, 33, 34], "convolution1d": 34, "convolutuon": 2, "convrelu": 13, "convsumrelu": 13, "convtranspose2d": [2, 13], "convtranspose3d": 13, "coo": 18, "cooper": [7, 30, 34], "copi": [5, 17, 18], "copyright": [17, 27], "core": [2, 7, 14, 17, 30, 33, 34], "core_id": [2, 20, 31], "correct": [7, 18, 25, 34], "correspond": [20, 31, 34], "cosine_embedding_loss": 8, "cost": [2, 6, 28, 30, 33], "costli": 33, "could": [7, 13, 16, 18, 26, 32, 33, 34], "count": 31, "counterpart": [2, 7, 18, 34], "coupl": [20, 33, 34], "cout": 6, "cover": [13, 18, 31], "cpp": [5, 6, 33], "cppsdk": 34, "cpu": [1, 3, 4, 5, 6, 7, 8, 10, 14, 15, 16, 19, 20, 23, 25, 26, 28, 30, 31, 32, 34], "cpu_capability_avx512": 17, "cpu_capability_avx512_bf16": 17, "cpu_featur": 17, "cpu_feature_main": 17, "cpu_launcher_arg": 32, "cpu_launcher_en": 32, "cpu_pool": [2, 20, 34], "cpu_pool1": 20, "cpu_pool2": 20, "cpuid": 17, "cpuinfo": 17, "cpunodebind": 33, "cpupool": [2, 20, 34], "crash": [31, 33, 34], "creat": [7, 16, 20, 33, 34], "creation": 2, "creator": 34, "credit": 17, "criteria": 16, "criterion": [6, 8, 16, 22], "cross": [32, 33, 34], "cross_entropy_loss": 8, "crossentropyloss": [6, 16], "csrc": 26, "csv": 14, "ctc_loss": 8, "cu": 5, "cudnn": 18, "current": [1, 2, 5, 7, 11, 13, 14, 15, 16, 17, 19, 20, 26, 28, 29, 34], "current_posit": 2, "custom": [1, 2, 7, 26, 34], "customized_forward": 10, "cv": 34, "cvt_fp32_to_bf16": 17, "cvt_fp32_to_bf16_kernel_fn": 17, "cvt_fp32_to_bf16_kernel_impl": 17, "cvt_fp32_to_bf16_kernel_stub": 17, "cvtfp32tobf16": 17, "cvtfp32tobf16krnl": 17, "cxx": [6, 17], "cxx11": 34, "cxx_standard": 6, "d": [4, 5, 6, 7, 8, 13, 26, 28, 34], "d8": 33, "d__avx512f__": 17, "d__avx__": 17, "dag": 13, "daili": 34, "data": [2, 4, 6, 7, 8, 9, 10, 11, 12, 13, 16, 17, 18, 19, 20, 21, 23, 26, 31, 32, 34], "data_typ": 18, "databrick": 28, "dataload": [2, 6, 10, 13, 16, 20, 22, 29, 34], "dataset": [6, 13, 16, 29, 30, 33, 34], "dataset_nam": [10, 34], "datatyp": [20, 34], "date": 34, "dcmake_prefix_path": 6, "dcpmm": 30, "dcpu_cap": 17, "dcpu_capability_amx": 17, "dcpu_capability_avx2": 17, "dcpu_capability_avx512": 17, "dcpu_capability_avx512_bf16": 17, "dcpu_capability_avx512_fp16": 17, "dcpu_capability_avx512_vnni": 17, "dcpu_capability_default": 17, "ddp": [2, 6], "ddr": 30, "ddr4": 33, "dealloc": 33, "debug": [2, 31], "debug_squad": [10, 34], "dec": 3, "decai": 7, "decid": [2, 15, 20, 28], "decim": 21, "declar": 17, "decltyp": 17, "decod": [2, 28, 30, 34], "deconv3d": 34, "decor": 2, "dedic": [2, 6, 28, 34], "deduct": 31, "deep": [3, 7, 8, 11, 13, 14, 21, 33], "deepcopi": 2, "deepspe": [2, 34], "def": [2, 6, 8, 10, 16, 20, 26, 34], "default": [2, 4, 6, 7, 10, 12, 13, 15, 16, 17, 20, 22, 23, 26, 28, 30, 32, 33, 34], "default_dynamic_qconfig": [15, 32], "default_dynamic_qconfig_map": 6, "default_dynamic_qconfigprepared_model": 4, "default_static_qconfig": [13, 15, 32, 34], "default_static_qconfig_map": 6, "default_static_qconfigprepared_model": 4, "defin": [2, 5, 6, 7, 8, 10, 16, 17, 18, 22, 32], "definit": [17, 21, 34], "deinit": 5, "deliv": [7, 28, 34], "demand": [2, 7], "demonstr": [6, 18, 26, 32], "demostr": 23, "denomin": 2, "denot": 21, "dens": [7, 18], "dep": 34, "depend": [5, 7, 17, 18, 25, 26, 33, 34], "deploi": 34, "deploy": [2, 7, 13, 34], "deployment_mod": [2, 6, 23], "deprec": [3, 26], "dequant": [13, 16], "desc": 18, "describ": [8, 13, 18, 21, 32, 33], "descript": [4, 7, 16, 18, 20, 25, 33, 34], "descriptor": 34, "design": [2, 5, 8, 18, 21, 29, 34], "desir": [16, 31], "destroy_process_group": 6, "destruct": 33, "detail": [2, 5, 6, 7, 8, 9, 11, 13, 17, 18, 24, 25, 26, 28, 30, 32, 33, 34], "detect": [1, 6, 12, 17, 26, 33, 34], "detectron2": 18, "determin": [2, 6, 17, 21, 33], "develop": [1, 3, 6, 28, 30, 33, 34], "devic": [1, 2, 15, 29, 31, 34], "device_nam": [7, 8], "diagram": [18, 33], "dict": [2, 6, 23], "dictionari": 34, "did": [33, 34], "didn": 20, "differ": [1, 2, 7, 15, 16, 17, 18, 20, 28, 31, 32, 33, 34], "difficult": 18, "difficulti": 16, "diffus": [3, 34], "digit": 21, "dim": [2, 6, 18, 23], "dimens": [2, 18, 26], "dinner": [6, 23], "dir": [17, 31], "direct": [2, 5, 13], "directli": [2, 6, 33, 34], "directori": [1, 5, 6, 14, 29, 31, 32], "dirty_decay_m": [31, 33], "disabl": [2, 6, 7, 13, 26, 31, 33, 34], "disable_auto_channels_last": 9, "disable_iomp": [14, 32], "disable_numactl": [14, 32], "disadvantag": 21, "discret": 1, "discrete gpu": 1, "discuss": [5, 18, 33], "dispatch": [1, 34], "dist": 6, "dist_sampl": 6, "distilbert": 30, "distribut": [2, 3, 7, 16, 31, 32, 33, 34], "distributeddataparallel": [6, 34], "distributedsampl": 6, "div": 13, "divid": [2, 13, 31, 32, 33, 34], "divis": [2, 20], "divisor": [2, 20], "dl": [3, 7, 34], "dlopen": 20, "dlrm": [3, 7, 26, 30, 34], "dnnl": 30, "dnnl_verbos": 2, "do": [2, 5, 8, 16, 18, 20, 21, 26, 28, 30, 31, 32, 33, 34], "do_ev": [10, 34], "do_sampl": [6, 23], "doc": [1, 2, 5, 11, 29, 34], "doc_strid": [10, 34], "docker": [30, 34], "dockerfil": 34, "dockerhub": 34, "docstr": 5, "document": [0, 7, 17, 20, 29, 34], "doe": [2, 7, 13, 18, 20, 26, 34], "doesn": [2, 15, 16, 18, 26, 34], "dolli": [28, 34], "domin": [1, 7, 28], "don": [2, 5, 8, 14, 17, 34], "done": [6, 10, 16, 17, 26, 33, 34], "dot": [2, 7, 18, 28], "doubl": 17, "down": [5, 32, 34], "download": [6, 13, 16], "downstream": 8, "dpc": 1, "dpcpp": 34, "dram": 2, "dramat": [32, 33], "drawback": [2, 21], "drive": [1, 7, 28], "driven": 2, "drop": [31, 32], "dropout": [2, 10], "dst": 17, "dtype": [2, 4, 6, 7, 8, 10, 11, 13, 15, 16, 17, 23, 26, 29, 31, 34], "due": [1, 8, 10, 17, 20, 26], "dummi": 32, "dummy_tensor": 32, "dummymodul": 10, "dump": [2, 31], "durat": [2, 21], "dure": [4, 6, 7, 10, 13, 16, 21, 31, 33, 34], "dynam": [1, 4, 20, 28, 32, 33, 34], "dynamic_qconfig": 15, "dynamic_quantized_model": 6, "e": [1, 2, 6, 7, 8, 12, 16, 17, 18, 28, 31, 33, 34], "each": [2, 8, 14, 16, 17, 19, 20, 21, 31, 32, 33, 34], "eager": [1, 7, 12, 23, 32, 34], "earli": [2, 34], "earlier": 21, "eas": [7, 18, 34], "easi": [1, 3, 21], "easier": [2, 18, 21], "easili": [10, 15], "ec2": 32, "edit": [5, 26, 34], "effect": [2, 17, 21, 26, 32, 33], "effici": [1, 7, 11, 19, 20, 28, 31, 33, 34], "effort": 34, "eig": 8, "einsum": 34, "either": [2, 26, 31], "el8_4": 30, "elaps": 33, "element": [2, 18, 19], "eleutherai": [2, 28], "elif": 6, "elimin": 28, "els": [6, 14, 17, 18, 23], "elser": 34, "eltwis": 34, "elu": 13, "emb": 7, "emb1": 7, "emb2": 7, "emb3": 7, "emb_m": 7, "embed": [2, 7, 28, 34], "embedding_bag": 10, "embedding_spec": 7, "embeddingbad": 34, "embeddingbag": [7, 26, 34], "embeddingspec": 7, "embedingbag": 7, "emblist": 7, "emerg": [1, 7, 28], "emphas": 33, "emply_lik": 2, "empow": 3, "empti": [18, 31], "enabl": [1, 2, 3, 4, 6, 7, 8, 10, 13, 16, 18, 20, 22, 23, 26, 28, 31, 32, 33, 34], "enable_auto_channels_last": 9, "enable_auto_mix_precis": 34, "enable_auto_mixed_precis": 34, "enable_auto_optim": 34, "enable_blockwise_loss": [16, 22], "enable_jemalloc": 32, "enable_onednn_fus": [2, 13], "enable_tcmalloc": 32, "encod": 34, "encount": [26, 34], "encourag": 34, "end": [6, 13, 20, 34], "endif": 17, "endl": 6, "engin": [1, 18, 33], "enhanc": [1, 3, 28, 34], "enough": [2, 7, 19], "ensur": [11, 19, 20, 32], "entir": [2, 16, 28], "enumer": [6, 13, 16, 29], "env": [6, 29], "env_key1": 5, "env_key2": 5, "env_val1": 5, "env_val2": 5, "environ": [2, 5, 6, 17, 20, 24, 28, 30, 31, 32, 33], "ep": [2, 7, 10, 19], "epoch": 16, "equal": [2, 15, 20, 31, 32, 33], "equip": 33, "equival": 34, "error": [2, 5, 6, 7, 10, 16, 18, 21, 22, 26, 34], "especi": [2, 5, 28, 34], "etc": [2, 5, 6, 17, 34], "eval": [2, 4, 6, 8, 10, 11, 12, 13, 15, 16, 20, 23, 26, 29, 32, 34], "eval_func": [2, 16, 34], "eval_funct": 4, "evalu": [2, 16, 34], "even": [2, 5, 7, 33, 34], "evenli": 31, "everi": [2, 28], "exact": 2, "exactli": 21, "exampl": [2, 5, 7, 8, 13, 18, 19, 21, 22, 23, 24, 25, 28, 29, 32, 33, 34], "example_input": [2, 4, 6, 13, 15, 29, 32, 34], "example_kwarg_input": 2, "examplenet": 20, "examplenet1": 20, "examplenet2": 20, "exce": [26, 30, 33, 34], "except": [28, 31], "excess": 34, "excit": 34, "exclus": 31, "execut": [2, 4, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 19, 20, 26, 31, 32, 33, 34], "exetens": 2, "exhibit": 30, "exist": [1, 5, 7, 13, 26, 31, 33], "exit": [6, 31], "exp": 13, "expect": [2, 7, 30, 34], "expecttest": 5, "expens": 18, "experi": [5, 7, 10, 12, 16, 18, 26, 33, 34], "experiment": 34, "explain": [17, 18, 21], "explicit": [18, 20, 33], "explicitli": [2, 8, 16, 20, 26, 31, 34], "explor": 2, "expon": 21, "export": [4, 31, 33], "expos": 8, "express": [18, 34], "ext": [6, 34], "extend": [1, 18, 25, 33, 34], "extens": [2, 3, 4, 6, 9, 10, 13, 14, 16, 17, 23, 24, 25, 27, 28, 29, 30, 31, 33, 34], "extra": [2, 5, 10, 20, 31, 32], "extra_rope_config": 2, "extrem": [7, 14, 33], "f": [5, 6, 13, 16, 28, 34], "f1": 30, "f16c": 17, "f32": [17, 18], "f401": [6, 11, 12, 13, 16, 23, 29], "face": 3, "facebook": [3, 6, 28], "facilit": 34, "fact": [18, 33], "factor": [2, 6, 16, 31], "fail": [10, 26, 34], "failur": [12, 34], "fake": 2, "fake_quantize_per_tensor_affin": 8, "falcon": [2, 28, 34], "fall": [2, 6, 12], "fals": [2, 4, 6, 7, 8, 13, 14, 15, 16, 17, 20, 22, 23, 26, 31, 32, 34], "famili": [2, 28, 33], "fashionmnist": 16, "fast": [4, 12, 33, 34], "fast_bert": [2, 4, 6, 7, 11, 34], "fast_layer_norm": [2, 34], "faster": [6, 7, 8, 30, 33], "fastest": 17, "fastlayernorm": [2, 34], "fatal_error": 6, "favorit": 31, "fb": 34, "feasibl": 10, "featur": [0, 1, 2, 3, 5, 8, 10, 13, 14, 18, 20, 23, 25, 26, 28, 30, 31, 32, 33, 34], "feb": 3, "feed": [2, 9, 18], "feedback": 34, "feedforward": 28, "feel": [5, 18, 34], "few": [5, 7, 9, 13, 16, 18, 32, 34], "fewer": 21, "fft_fft": 8, "fft_fft2": 8, "fft_fftn": 8, "fft_hfft": 8, "fft_ifft": 8, "fft_ifft2": 8, "fft_ifftn": 8, "fft_ihfft": 8, "fft_irfft": 8, "fft_irfft2": 8, "fft_irfftn": 8, "fft_rfft": 8, "fft_rfft2": 8, "fft_rfftn": 8, "figur": [1, 2, 21, 28, 33], "file": [2, 4, 5, 6, 8, 14, 15, 16, 17, 18, 31, 34], "filenam": 5, "find": [1, 2, 7, 14, 16, 23, 26, 30, 31, 34], "find_packag": 6, "findavx": 17, "fine": [3, 20, 29, 31, 32, 33, 34], "finer": [1, 7, 20], "finish": [6, 11, 12, 13, 16, 20], "first": [2, 3, 5, 6, 7, 9, 10, 12, 16, 19, 20, 21, 26, 31, 32, 33], "firstli": [2, 28], "fit": [5, 7, 33, 34], "fix": [2, 5, 7, 34], "flag": [2, 5, 7, 17, 20, 31, 34], "flake8": 5, "flan": 28, "flash": 34, "flatten": [16, 20], "flexibl": 34, "float": [2, 6, 7, 8, 14, 15, 16, 17, 21, 29, 34], "float16": [2, 8], "float32": [2, 13, 21, 23, 26, 30, 31, 34], "float64": 8, "flourish": 28, "flow": 26, "flush": [6, 23], "fma": 17, "fn_type": 17, "focu": [2, 10, 18, 29, 34], "focus": [13, 34], "fold": [2, 10, 15, 16, 26, 34], "folder": 5, "follow": [1, 2, 4, 5, 6, 7, 8, 11, 14, 15, 16, 17, 18, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34], "footbal": 7, "footprint": [7, 21, 28, 34], "forg": 33, "fork": [17, 33], "format": [2, 5, 6, 7, 9, 14, 22, 26, 28, 31, 33, 34], "format_tag": 18, "formerli": [30, 33, 34], "formula": 21, "forward": [2, 6, 8, 13, 16, 20, 21, 26, 32, 33, 34], "found": [1, 6, 7, 14, 16, 18, 29, 31, 32, 33, 34], "foundat": [18, 33], "fp16": [2, 6, 17, 29], "fp32": [2, 4, 16, 17, 19, 21, 23, 28, 34], "fp32_gw": 21, "fp32_w": 21, "fpn": 30, "fraction": 21, "fractional_max_pool2d": 8, "fractional_max_pool3d": 8, "fragment": 33, "framework": [5, 34], "free": [31, 34], "freez": [6, 8, 10, 13, 15, 16, 20, 23, 26, 32, 34], "freezed_model": [26, 34], "frequenc": [2, 30], "frequent": 7, "friendli": [7, 33], "from": [1, 2, 3, 4, 5, 8, 10, 11, 13, 15, 16, 17, 18, 19, 20, 21, 23, 25, 28, 29, 31, 32, 33, 34], "from_embeddingbag_list": 7, "from_pretrain": [4, 6, 11, 23, 29, 32], "front": [13, 34], "frontend": [1, 2, 7, 20, 28, 34], "frozenbatchnorm": 34, "frozenbatchnorm2d": 7, "fsi": 34, "fulfil": 20, "full": [2, 5, 18, 32, 33, 34], "fulli": [5, 15, 17, 21, 31, 33, 34], "function": [2, 5, 6, 7, 8, 10, 11, 12, 14, 15, 17, 20, 21, 23, 26, 28, 29, 31, 33, 34], "further": [1, 2, 5, 6, 7, 18, 20, 28, 33, 34], "fuse": [2, 7, 13, 16, 19, 28, 34], "fuse_update_step": 2, "fusion": [1, 2, 7, 10, 21, 28, 34], "futur": [7, 28, 34], "futuretensor": 20, "fx": [3, 7, 10, 26, 34], "g": [2, 7, 8, 16, 17, 18, 28, 34], "gain": [1, 7, 26, 28, 34], "game": 7, "gave": 14, "gb": 20, "gcc": 17, "gcp": 3, "gelu": [2, 13, 34], "gemm": [7, 18, 26, 28, 34], "gen": [3, 30, 34], "gen_": 2, "gen_id": [6, 23], "gen_text": [6, 23], "genai": [1, 7, 28], "gender": 7, "gener": [1, 5, 6, 7, 10, 12, 16, 17, 18, 21, 23, 28, 29, 30, 31, 32, 33, 34], "generate_kwarg": [6, 23], "genv": 31, "geomean": 34, "geqrf": 8, "get": [1, 2, 3, 4, 6, 7, 10, 11, 15, 17, 20, 21, 22, 26, 28, 29, 30, 31, 33, 34], "get_acceler": 29, "get_core_list_of_node_id": 2, "get_cpp_typesize_and_vecs": 17, "get_cpp_typesize_and_vecsize_kernel_fn": 17, "get_cpp_typesize_and_vecsize_kernel_impl": 17, "get_cpp_typesize_and_vecsize_kernel_stub": 17, "get_smooth_quant_qconfig_map": [2, 6, 29], "get_weight_only_quant_qconfig_map": [2, 6, 29], "getattr": [6, 23], "getveclength": 17, "getveclengthkrnl": 17, "gif": 31, "gil": 20, "git": [2, 5, 28], "github": [1, 2, 5, 6, 7, 8, 34], "give": [32, 34], "given": [2, 6, 13, 14, 16, 28], "global": [2, 20, 22, 34], "global_past_key_valu": 6, "gnu": [6, 17, 32], "go": [2, 5, 8], "gomp_cpu_affin": 33, "good": [1, 2, 5, 7, 12, 18, 19, 28, 33, 34], "googl": [3, 5, 28], "gperftool": 33, "gpertool": 33, "gpt": [2, 28, 30], "gpt2": 26, "gptbigcod": [2, 28], "gptj": 2, "gptjforcausallm": 2, "gptq": [2, 6, 34], "gpu": [1, 3, 18, 34], "grad": [7, 19], "grad0": 19, "grad1": 19, "grad_i": 19, "grad_n": 19, "gradient": 7, "grain": [1, 3, 7, 20], "granular": [2, 31, 32, 33], "graph": [1, 4, 8, 10, 16, 23, 26, 31, 34], "graph_for": 13, "graph_mod": [2, 4, 7, 12, 34], "graphic": 33, "great": 33, "greater": 2, "greedi": [6, 23], "grid": 14, "grid_sampl": 8, "grokk": 3, "ground": 21, "group": [2, 19, 20, 33], "group_norm": 8, "group_siz": 2, "gru": 15, "grucel": 15, "gt": [4, 14, 28, 33], "gtest_filt": 5, "guid": [3, 6, 7, 17, 32, 34], "guidanc": 7, "guidelin": 18, "gw": 21, "h": [5, 6, 7, 16, 18, 26, 31, 32], "ha": [0, 1, 2, 7, 10, 14, 17, 18, 20, 21, 26, 28, 30, 31, 33, 34], "had": [6, 33], "half": [2, 7, 17, 21], "halv": 21, "handl": [6, 18, 33], "handler": 32, "hang": [33, 34], "happen": 7, "hard": [18, 26], "hardsigmoid": 34, "hardswish": [13, 34], "hardtanh": 13, "hardwar": [1, 3, 17, 25, 28, 32, 34], "hav": 17, "have": [1, 2, 5, 6, 7, 9, 14, 17, 18, 20, 21, 23, 26, 27, 28, 30, 31, 32, 33, 34], "head": [2, 34], "head_dim": 2, "head_map": 2, "head_mask": 2, "head_num": 2, "head_siz": 2, "header": 17, "heavi": 7, "heavier": 28, "height": 18, "hello": 5, "help": [2, 5, 6, 17, 23, 28, 31, 33, 34], "helper": 2, "here": [5, 8, 10, 13, 16, 17, 18, 20, 26, 32, 33, 34], "herebi": 16, "hero": 34, "heterogen": 34, "heurist": [2, 20, 34], "hf": [6, 28], "hf_beam_sampl": 34, "hf_beam_search": 34, "hf_greedy_search": 34, "hf_sampl": 34, "hidden": [2, 18, 28], "hidden_s": [2, 6], "hidden_st": 2, "high": [19, 21, 33], "higher": [2, 7, 13, 17, 18, 28], "higher_is_bett": 14, "highli": [7, 23, 28, 33, 34], "hinge_embedding_loss": 8, "hint": [2, 20], "histogram": [30, 34], "histogramobserv": [2, 15], "histori": [2, 14, 28], "hobbi": 7, "hold": [18, 33], "home": [31, 32], "homebrew": 5, "hood": 34, "hook": [10, 16], "hopefulli": 7, "host": [30, 34], "hostfil": 31, "hostnam": 31, "hotspot": 28, "how": [1, 2, 10, 15, 17, 18, 23, 28, 31, 32, 33, 34], "howev": [2, 5, 7, 8, 9, 16, 20, 26, 28, 31, 33, 34], "hp": 14, "hpc": 11, "html": [2, 5, 16], "http": [2, 5, 16, 34], "hub": 28, "huber_loss": 8, "hug": 3, "huge": [7, 14, 33], "hugginfac": 34, "huggingfac": [2, 6, 26, 28, 32, 34], "huggingface_transform": 32, "hurt": 20, "hw": 18, "hwc": 18, "hwio": 18, "hwn": 18, "hydra": 31, "hyper": [2, 30, 33, 34], "hyperparam": 14, "hyperparamet": [4, 7], "hyperparamt": 14, "hyperthread": 32, "hypertun": [4, 34], "hypertune_directori": 14, "hypervisor": 34, "hypothesi": 5, "i": [1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 21, 22, 23, 26, 27, 28, 29, 30, 32, 33, 34], "i_mpi_pin_domain": 31, "iakv": [2, 28], "ic": 2, "ic_block": 2, "id": [2, 31, 32], "idea": [11, 21, 33], "ideep": [17, 18], "ident": [2, 10, 18], "identif": [6, 17], "identifi": 34, "idx": [2, 28, 31], "ieityuan": 28, "illeg": 34, "illustr": [18, 19, 21, 31, 33], "imag": [8, 13, 18, 33, 34], "image_classifi": 32, "imagenet": [18, 30], "immedi": 7, "immintrin": 17, "impact": [2, 7, 20], "imper": [20, 34], "impl": 17, "implement": [1, 5, 7, 11, 19, 26, 28, 33, 34], "implicit": 18, "implicitli": 6, "import": [1, 2, 4, 5, 6, 7, 10, 11, 12, 13, 15, 16, 17, 18, 20, 21, 23, 25, 26, 28, 29, 32, 33, 34], "impract": [7, 14], "improv": [1, 3, 7, 8, 13, 20, 22, 28, 30, 32, 33], "in1": 7, "in2": 7, "in3": 7, "in_i": 7, "in_m": 7, "inaccur": 21, "inc": [16, 17, 22, 28], "includ": [1, 2, 5, 6, 7, 10, 14, 15, 17, 23, 26, 27, 28, 30, 34], "inclus": 33, "incorrect": [12, 26, 34], "increas": [1, 2, 3, 21, 26, 28, 30, 33, 34], "independ": 31, "index": [2, 5, 18, 28, 33], "index_copi": 8, "index_to_nam": 32, "indic": [2, 6, 18, 28], "indirect": 2, "indirect_access_kv_cache_attent": [2, 34], "indirectaccesskvcacheattent": [2, 34], "individu": [5, 30], "inductor": [7, 34], "inevit": 10, "inf": 14, "infer": [2, 3, 4, 7, 10, 11, 12, 15, 18, 20, 21, 23, 26, 30, 33, 34], "inferenc": 2, "inference2": 30, "inference3": 30, "inference_mod": [6, 23, 29], "influenc": [31, 33], "info": [2, 6, 17, 26, 31, 32, 34], "inform": [1, 2, 6, 7, 14, 17, 18, 28, 31, 32, 33, 34], "ingredi": 18, "init": [2, 5, 15, 34], "init_alpha": [16, 22], "init_distribut": 29, "init_infer": 29, "init_method": 6, "init_process_group": 6, "initi": [2, 20, 32], "inject": 34, "inlin": 17, "inplac": [2, 4, 6, 13, 15, 18, 23, 32], "input": [2, 6, 7, 9, 10, 13, 15, 16, 17, 18, 22, 23, 26, 29, 30, 32, 33, 34], "input1": 10, "input_channel": 2, "input_hint": 20, "input_id": [6, 23], "input_ids_pad": 6, "input_s": [6, 23], "input_split_hint": [2, 20], "input_tokens_length": [6, 23], "inputpath": 32, "insert": [2, 16], "insid": [2, 5, 20, 31], "inspir": 34, "instal": [4, 5, 6, 23, 25, 26, 28, 33, 34], "instanc": [2, 7, 10, 14, 32, 34], "instance_idx": 31, "instancenorm": 34, "instanti": 6, "instead": [7, 8, 14, 19, 20, 29, 30, 31, 32, 33, 34], "instruct": [1, 2, 5, 6, 7, 8, 17, 21, 23, 24, 25, 28, 30, 33, 34], "int": [2, 6, 7, 14, 17, 23, 26, 29, 31, 34], "int32": 2, "int4": [2, 28, 29, 34], "int8": [1, 2, 3, 4, 17, 18, 20, 22, 28, 29, 34], "int8_qconfig": 6, "integ": [28, 31, 33], "integr": [7, 18, 28, 33, 34], "intel": [2, 3, 4, 6, 7, 8, 9, 10, 11, 13, 14, 16, 17, 20, 21, 22, 23, 25, 26, 27, 28, 29, 34], "intel discrete gpu": 1, "intel optim": 1, "intel_extension_for_pytorch": [1, 2, 4, 5, 6, 7, 10, 11, 12, 13, 14, 15, 16, 17, 20, 23, 25, 29, 32, 34], "intel_pytorch_extens": [7, 25, 26, 34], "intel\u00ae extension for pytorch*": 1, "intend": 5, "intent": 5, "interact": [7, 34], "interconnect": 33, "interest": 5, "interfac": [5, 6, 18, 26, 28], "intern": [17, 18, 20, 32], "interpret": 31, "interrupt": 32, "intervent": 8, "intra": 2, "intrins": 17, "introduc": [1, 3, 7, 15, 18, 21, 22, 31, 33, 34], "introduct": [0, 2, 7, 28, 33, 34], "invalid": 33, "invers": 8, "investig": [2, 31], "invoc": [1, 7], "invok": [2, 6, 8, 10, 13, 20, 23, 26, 29, 34], "involv": 21, "io": 28, "iostream": 6, "ip": 31, "ipex": [1, 2, 3, 4, 6, 7, 9, 11, 12, 13, 15, 16, 17, 19, 20, 23, 26, 29, 31, 32, 34], "ipex_declare_dispatch": 17, "ipex_define_dispatch": 17, "ipex_en": 32, "ipex_fus": 2, "ipex_register_dispatch": 17, "ipexconfig": 6, "ipexrun": [4, 10, 31, 34], "is_caus": 2, "is_contigu": 18, "is_dynam": [6, 15], "is_hyperthreading_en": 14, "is_runtime_ext_en": 2, "isa": [1, 34], "isa_codegen": 17, "isa_nam": 17, "isacodegen": 17, "issu": [1, 2, 5, 8, 21, 26, 33], "ital": 32, "item": 16, "iter": [2, 16, 21, 28, 34], "its": [2, 6, 7, 8, 14, 17, 21, 28, 30, 31, 32, 33, 34], "itself": [2, 5, 18], "ivalu": 6, "j": [2, 5, 17, 28, 30], "jan": 3, "je": 14, "jemalloc": [30, 32, 34], "jemallocl": 31, "jit": [1, 2, 5, 6, 7, 8, 13, 15, 16, 18, 20, 23, 26, 32, 34], "job": 5, "join": 33, "joint": 34, "joint_net": [26, 34], "json": [2, 6, 15, 16, 32, 34], "jul": 3, "jun": 3, "jupyt": 5, "just": [2, 14, 29, 34], "k": [2, 5], "kcpu": 17, "keep": [5, 12, 18, 21, 28, 32, 33, 34], "kei": [2, 7, 28, 34], "kept": 21, "kernel": [1, 2, 7, 20, 26, 28, 30, 33, 34], "kernel_s": 10, "key_cach": 2, "key_token": 2, "keystrok": 5, "keytensor": 2, "keyword": 2, "kill": 32, "kind": 7, "kineto_librari": 6, "kl_div": 8, "kmp": [31, 33], "kmp_": 20, "kmp_affin": [31, 32, 33], "kmp_blocktim": [31, 32, 33], "knob": [2, 4, 12, 31], "know": 5, "knowledg": 33, "known": [6, 10, 28], "kt": 3, "kv": 2, "kv_cach": [2, 28], "kwarg": [2, 29], "l1318": 2, "l1_loss": 8, "l2": 33, "l23": 2, "l4": 2, "l50": 2, "l76": 2, "label": 8, "lake": [7, 30, 34], "lamb": [19, 21], "land": [7, 34], "landscap": [1, 7, 28], "languag": [1, 2, 23, 24, 25, 26, 29, 34], "lar": 34, "larg": [1, 2, 19, 23, 24, 25, 26, 29, 30, 33, 34], "larger": [2, 20, 30, 31, 33, 34], "last": [3, 10, 21, 26, 34], "last_ind": 6, "latenc": [3, 14, 18, 28, 30, 32, 34], "later": [2, 7, 25, 33], "latest": [1, 2, 25, 28, 30, 34], "launch": [4, 6, 20, 32, 34], "launcher": [7, 13, 31, 33, 34], "law": 7, "layer": [2, 16, 20, 22, 28, 34], "layer_past": 2, "layernorm": [2, 13, 16, 22, 34], "layernorm_modul": 2, "layout": [2, 26, 34], "lazi": 5, "ld": 31, "ld_preload": [20, 31, 32, 33], "ldd": 6, "lead": 28, "leaki": 13, "leaky_relu": 13, "leakyrelu": 34, "learn": [3, 7, 8, 11, 13, 14, 21, 31, 33], "learning_r": [10, 34], "leav": [2, 20, 33], "left": [21, 28, 32], "legal": 34, "legend": 28, "len": [2, 6, 7, 13, 16, 17], "length": [2, 5, 14, 21, 26, 30, 34], "less": [2, 8, 18, 20, 26, 34], "let": [5, 10, 18, 19, 20, 21], "level": [7, 10, 13, 16, 18, 20, 21, 26, 33, 34], "leverag": [1, 7, 11, 28, 32, 34], "lib": [6, 31, 32], "lib64": [31, 32], "libc10": 6, "libdnnl_graph": 6, "libgomp": 33, "libintel": [6, 34], "libiomp": 33, "libiomp5": [20, 31, 32, 33], "libjemalloc": 31, "libpytorch_path": 6, "librari": [1, 2, 5, 6, 7, 17, 20, 32, 33, 34], "libtcmalloc": [31, 32], "libtorch": [6, 34], "libtorch_cpu": 6, "libxsmm": 2, "licens": 17, "lighter": 8, "like": [1, 2, 3, 5, 6, 7, 8, 14, 18, 19, 21, 26, 28, 31, 33, 34], "limit": [5, 8, 10, 20, 26, 32, 33, 34], "linalg_choleski": 8, "linalg_cholesky_ex": 8, "linalg_cond": 8, "linalg_eig": 8, "linalg_eigh": 8, "linalg_eigv": 8, "linalg_eigvalsh": 8, "linalg_householder_product": 8, "linalg_inv": 8, "linalg_inv_ex": 8, "linalg_lstsq": 8, "linalg_matrix_rank": 8, "linalg_qr": 8, "linalg_solv": 8, "linalg_svd": 8, "linalg_svdv": 8, "linalg_tensorinv": 8, "linalg_tensorsolv": 8, "line": [5, 10, 13, 18, 31, 32, 33], "linear": [2, 6, 7, 8, 13, 15, 16, 18, 26, 33, 34], "linear2silumul": [2, 34], "linear_": 2, "linear_bn": 2, "linear_bn_fold": 2, "linear_m": 2, "linear_m_modul": 2, "linear_modul": 2, "linear_relu_stack": 16, "linear_s_modul": 2, "linearadd": [2, 34], "linearaddadd": [2, 34], "lineargelu": [2, 34], "linearize_indices_and_offset": 7, "linearmul": [2, 34], "linearnewgelu": [2, 34], "linearrelu": [2, 34], "linearsilu": [2, 34], "linearsilumul": [2, 34], "link": [1, 6, 17, 34], "linux": [5, 6, 17, 30, 31, 33], "list": [2, 5, 7, 8, 13, 14, 16, 18, 25, 29, 31, 32, 33, 34], "liuhaotian": 28, "live": 5, "ll": [5, 32, 33], "llama": [2, 3, 6, 28, 34], "llama2": [30, 34], "llama3": 34, "llava": [2, 28], "llm": [1, 16, 22, 24, 25, 34], "load": [1, 2, 6, 7, 13, 15, 16, 17, 23, 29, 32, 34], "load_dataset": 6, "load_qconf_summari": 15, "load_state_dict": [2, 34], "loader": 16, "local": [6, 20, 28, 31, 32, 33], "locat": [5, 17, 34], "log": [4, 6, 13, 31, 32, 34], "logic": [2, 14, 18, 32, 33], "login": 6, "logit": 16, "long": [2, 6, 18, 21, 26, 28, 34], "long_factor": 2, "longer": [26, 30, 34], "longform": 26, "look": [5, 14, 16, 18], "loop": [5, 21, 29], "lose": 21, "loss": [2, 5, 6, 8, 16, 18, 21, 26], "loss_fn": 16, "lot": [28, 34], "low": [3, 4, 6, 7, 21, 23, 31, 33, 34], "low_cpu_mem_usag": [6, 23], "low_precision_checkpoint": [2, 6, 29], "lower": [2, 8, 17, 21, 28, 34], "lowest": 2, "lowp": [2, 6], "lowp_mod": [2, 6, 29], "lr": [6, 7, 8, 16, 19], "lr_decai": 19, "lsb": 17, "lscpu": 33, "lstm": [2, 10, 15, 34], "lstmcell": 15, "lstsq": 8, "lt": [4, 28, 30], "lu_solv": 8, "m": [4, 14, 20, 26, 31, 32, 33, 34], "m6i": [30, 32], "m7i": 30, "machin": [3, 5, 6, 7, 14, 17, 26, 31, 32, 33, 34], "maco": 5, "macro": 17, "made": [5, 34], "mai": [1, 2, 3, 5, 6, 7, 8, 9, 16, 17, 18, 20, 26, 28, 31, 32, 33, 34], "main": [1, 2, 5, 6, 14, 20, 31, 32], "mainli": [31, 34], "maintain": 8, "major": 16, "make": [2, 5, 6, 7, 14, 15, 17, 21, 23, 28, 32, 33], "make_tupl": 17, "makefil": 5, "malloc": [14, 31, 33], "malloc_conf": [31, 33], "mamx": 17, "man": [7, 33], "manag": [2, 8, 13, 20, 28, 31], "mandatori": 14, "mani": [5, 14, 28, 31, 33, 34], "manipul": 18, "mantissa": 21, "manual": [2, 7, 10, 14, 18, 20, 34], "manual_se": [6, 11], "map": [2, 6, 18, 30], "mar": [3, 32], "margin_ranking_loss": 8, "mask": [2, 7, 17, 26], "mask_valu": 17, "maskrcnn": [33, 34], "maskrnn": 34, "master": [2, 7, 21, 31], "master_addr": 6, "master_port": 6, "match": [2, 8, 17, 31], "math": 7, "matmul": [2, 8, 13, 26, 34], "matrix": [1, 6, 7, 25, 28], "matrix_rank": 8, "matur": 34, "mavx2": 17, "mavx512bf16": 17, "mavx512bw": 17, "mavx512dq": 17, "mavx512f": 17, "mavx512fp16": 17, "mavx512vl": 17, "mavx512vnni": 17, "max": [2, 6, 16, 17, 22, 23, 26, 34], "max_context_len": 2, "max_new_token": [6, 23], "max_num_blocks_per_seq": 2, "max_position_embed": 2, "max_seq": 2, "max_seq_len": 30, "max_seq_length": [10, 34], "max_seqlen_k": 2, "max_seqlen_q": 2, "max_trial": 14, "max_unpool2d": 8, "max_unpool3d": 8, "maxim": 14, "maximum": [2, 16, 17], "maxpool": 34, "maxpool2d": 13, "maycontainalia": 5, "md": 18, "me": 18, "mean": [2, 16, 17, 18, 20, 22, 28, 34], "meant": 34, "meanwhil": [12, 33, 34], "measur": [30, 34], "mechan": [1, 7, 17, 21, 34], "medium": 28, "meet": [21, 33, 34], "meltdown": 30, "membind": 33, "memori": [2, 6, 7, 8, 9, 10, 13, 19, 20, 21, 26, 28, 30, 32, 34], "memory_format": [6, 7, 18, 23], "mention": [3, 10, 20, 21, 34], "merg": [0, 7, 34], "merged_emb": 7, "merged_input": 7, "mergedembeddingbag": 7, "mergedembeddingbagwith": 7, "mergedembeddingbagwithsgd": 7, "merit": 18, "mermori": 2, "messag": [2, 6, 10, 12, 18, 31], "meta": [6, 18, 28, 29, 34], "metadata_thp": [31, 33], "method": [2, 8, 15, 16, 18, 22, 26, 33, 34], "method1": 10, "method2": 10, "methodologi": [2, 6, 7, 19, 33], "methond": 15, "metric": [2, 16, 30], "mfma": 17, "mha": [2, 34], "mhz": 33, "microarchitectur": 33, "microsoft": [2, 28], "might": [2, 7, 18, 26, 33, 34], "migrat": 7, "millisecond": 33, "min": [2, 16, 22, 26, 34], "mind": [18, 32], "mini": [2, 20, 28, 34], "minim": [7, 14, 17, 33], "minimum": [14, 16, 18], "minmax": 34, "minmaxobserv": [2, 6, 15], "misc": 34, "mish": 13, "miss": 5, "mistral": [2, 28, 34], "mistralai": 28, "mitig": [20, 30], "mix": [2, 6, 13, 23, 26, 28, 34], "mixed_dtyp": 34, "mixtral": [2, 28], "mixtur": [8, 34], "mkdir": 6, "mkl": 34, "mkldnn": 18, "mkldnn_util": 18, "mlp": 34, "mm": 8, "mmuzzy_decay_m": 33, "mmx": 17, "mno": 17, "mobilenet": 30, "mode": [1, 2, 5, 7, 10, 12, 18, 20, 23, 26, 32, 34], "model": [1, 2, 3, 4, 8, 9, 10, 11, 12, 14, 16, 23, 24, 25, 26, 29, 30, 33, 34], "model1": 20, "model2": 20, "model_execut": 34, "model_id": [6, 23], "model_log": 32, "model_name_or_path": [10, 29, 34], "model_script": 20, "model_service_work": 32, "model_state_dict": 6, "model_stor": 32, "model_to_be_calibr": 34, "modelfamili": 28, "modeling_llama": 2, "modelurl": 32, "modern": 3, "modifi": [2, 5, 6], "modul": [1, 6, 7, 8, 13, 16, 17, 26, 29, 31, 34], "modular": 2, "modulist": 7, "momentum": [6, 10, 21], "monkei": 10, "more": [1, 2, 5, 6, 7, 8, 10, 11, 13, 16, 17, 19, 20, 21, 23, 26, 28, 32, 33, 34], "moreov": [1, 2, 28], "mosaicml": 28, "most": [6, 7, 13, 21, 28, 30, 32, 33, 34], "motherboard": 33, "motiv": [2, 20], "move": [18, 33], "movingaverageminmax": 34, "mp_size": 29, "mpi": 31, "mpiexec": 31, "mpt": [2, 28, 34], "mrpc": 30, "mse_loss": 8, "much": [15, 18, 21, 28, 31, 33], "mul": [2, 13, 16], "multi": [2, 7, 14, 20, 28, 31, 33, 34], "multi_margin_loss": 8, "multi_stream": 2, "multi_stream_input_hint": 34, "multi_stream_model": [20, 34], "multi_stream_output_hint": 34, "multidimension": 18, "multiheadattent": 28, "multilabel_margin_loss": 8, "multilabel_margin_loss_forward": 8, "multipl": [2, 5, 7, 8, 16, 17, 18, 26, 28, 30, 32, 33, 34], "multipli": 2, "multistreammodul": [2, 7, 20, 26, 34], "multistreammodulehint": [2, 20, 34], "multithread": 33, "must": [2, 5, 14, 17, 19], "mutual": 31, "muzzy_decay_m": [31, 33], "my": 18, "mykernel": 17, "mymodel": 34, "mypi": 5, "n": [2, 6, 7, 16, 18, 19, 20, 26, 32, 33, 34], "n1": 18, "n2": 18, "n_iter": 32, "name": [2, 5, 7, 14, 17, 25, 28, 31, 32, 33, 34], "namespac": [8, 17], "nan": [17, 34], "nanquantil": 8, "narg": 6, "narrow": 5, "nativ": [1, 6, 7, 8, 17, 19, 21, 26, 28, 34], "natur": [18, 21, 28], "naver": 3, "nb": 18, "nc": 32, "nchw": [7, 33], "ncore": [10, 31], "ncore_per_inst": [14, 34], "ncores_per_inst": 14, "nd": 18, "necessari": 18, "necessarili": 2, "neck": 19, "need": [2, 5, 6, 7, 10, 13, 14, 16, 17, 18, 19, 20, 21, 23, 26, 29, 31, 32, 33, 34], "need_linearize_indices_and_offset": 7, "neelnanda": 6, "neg": 21, "neglig": 18, "neighbor": 2, "neox": [2, 28], "net": 34, "network": [1, 3, 7, 8, 20, 25, 28, 33], "neural": [1, 3, 7, 16, 22, 25, 28, 33, 34], "neuralnetwork": 16, "new": [3, 5, 12, 16, 17, 18, 20, 23, 26, 29, 33], "new_gelu": 2, "new_layer_past": 2, "newer": [1, 28, 33], "newgeluactiv": 2, "newkernel": 17, "newkernelkrnl": 17, "newli": 34, "newlin": 5, "next": [5, 7, 34], "nf4": [2, 29], "nhwc": [7, 33, 34], "nifti": 33, "ninstanc": [10, 14, 31, 34], "nint": 5, "nll_loss": 8, "nll_loss2d": 8, "nlp": [6, 7, 26, 30, 34], "nm": [7, 34], "nn": [2, 6, 7, 8, 10, 13, 15, 16, 18, 20, 26, 34], "nnc": 26, "nnode": 31, "no_grad": [4, 6, 10, 11, 12, 13, 15, 16, 20, 23, 26, 29, 32, 34], "node": [2, 20, 30, 32, 33, 34], "node0": 33, "node1": 33, "node_id": [2, 20, 31, 32, 34], "non": [2, 5, 8, 13, 18, 30, 32, 34], "noncontigu": 18, "none": [2, 6, 29, 31], "noqa": [6, 11, 12, 13, 16, 23, 29], "normal": [1, 2, 6, 7, 13, 20, 28, 33, 34], "normalized_shap": 2, "note": [2, 3, 5, 6, 15, 16, 17, 18, 20, 22, 24, 28, 30, 31, 32, 33], "notfound": 6, "noth": 2, "notic": [27, 31, 32], "nov": 3, "now": [2, 7, 15, 18, 32, 33, 34], "np": [16, 31], "nproc": 31, "nth": [32, 33], "num": [2, 20, 32, 33, 34], "num_attention_head": 6, "num_beam": [6, 23], "num_block": 2, "num_featur": 7, "num_head": 2, "num_hidden_lay": 6, "num_kv_head": 2, "num_nod": 14, "num_seq": 2, "num_stream": [2, 20, 34], "num_token": 2, "num_train_epoch": [10, 34], "numa": [2, 20, 31, 32, 34], "numactl": [20, 31, 32], "number": [1, 2, 5, 6, 7, 14, 16, 19, 20, 21, 26, 32, 34], "numer": [2, 8, 33], "numpi": 16, "o": [6, 17, 23, 30], "o0": [2, 26, 34], "o1": [2, 26, 34], "o3": 17, "object": [2, 6, 7, 14, 17, 20, 33, 34], "observ": [2, 9, 13, 15, 34], "obsev": 15, "obtain": 16, "obviou": 28, "occupi": 26, "occur": 34, "occurr": 28, "off": [7, 8, 21, 28, 30, 34], "offer": [1, 5, 33], "offici": [5, 32, 33, 34], "offlin": 34, "offset": [2, 18, 28], "often": 7, "old": 34, "omp": [20, 26, 31, 32, 33, 34], "omp_num_threa": 26, "omp_num_thread": [20, 26, 31, 32, 34], "omp_proc_bind": [31, 33], "omp_schedul": [31, 33], "omp_set_num_thread": 34, "onboard": [19, 33], "onc": [2, 5, 6, 14, 17, 18, 20, 21, 32, 33], "ondevic": 29, "one": [2, 5, 7, 12, 13, 14, 16, 18, 19, 20, 26, 29, 31, 33, 34], "oneapi": [6, 33], "oneccl": [3, 6, 31, 34], "oneccl_bindings_for_pytorch": 6, "onednn": [2, 3, 13, 17, 26, 28, 34], "onednn_primitive_cache_capac": 33, "onednn_verbos": 4, "ones": [2, 6, 17], "onli": [1, 2, 5, 7, 8, 10, 11, 13, 14, 15, 16, 17, 18, 20, 21, 26, 28, 31, 32, 34], "onlyquantizationint4": 28, "onlyquantizationint8": 28, "oob": [10, 34], "op": [2, 7, 15, 16, 22, 28, 34], "op_type_dict": 2, "open": [1, 16, 28, 33], "openai": 28, "openmp": [2, 7, 20, 26, 30, 32, 34], "oper": [1, 2, 6, 8, 13, 15, 21, 32, 33, 34], "opportunit": 2, "opt": [2, 6, 17, 28], "optdecoderlay": 16, "optim": [1, 3, 4, 6, 8, 9, 11, 12, 14, 16, 18, 20, 21, 23, 25, 26, 31, 32, 33, 34], "optimize_lstm": 2, "optimize_transform": 34, "optimized_model": [2, 34], "optimized_optim": 2, "optimizer_state_dict": 6, "optimum": 10, "optin": 2, "option": [1, 2, 5, 7, 10, 14, 15, 16, 29, 31, 34], "optyp": 2, "order": [2, 17, 18, 21, 31, 33, 34], "org": [2, 7, 16, 26, 34], "organ": 18, "orgqr": 8, "origin": [2, 6, 7, 12, 13, 15, 17, 20, 29, 34], "original_max_position_embed": 2, "original_model": 2, "ormqr": 8, "other": [2, 6, 7, 8, 14, 17, 18, 19, 23, 28, 31, 33], "other_1": 2, "other_2": 2, "other_arg": 19, "otheriws": 13, "otherwis": [2, 7, 20], "our": [5, 16, 19, 28, 33, 34], "out": [2, 5, 6, 7, 8, 10, 13, 16, 19, 20, 30, 31, 33, 34], "outlier": [7, 16], "outplac": [18, 34], "output": [2, 6, 7, 8, 13, 14, 16, 18, 23, 26, 34], "output_concat_hint": [2, 20], "output_dir": [10, 14, 34], "output_hint": 20, "output_tokens_length": [6, 23], "outsid": 20, "outstand": 5, "over": [5, 7, 8, 9, 16, 18, 30, 31, 34], "overal": 33, "overflow": [26, 34], "overhead": [1, 2, 7, 10, 19, 20, 26, 28, 33, 34], "overlap": 32, "overrid": 15, "overridden": [2, 17], "oversize_threshold": [31, 33], "overview": [7, 25, 34], "overwrit": [2, 31], "own": [2, 6, 15, 28], "owner": 13, "p29": 30, "p90": 30, "pack": [2, 20, 34], "packag": [1, 2, 5, 6, 7, 10, 23, 25, 26, 32, 33, 34], "packed_weight": 2, "packed_zp": 2, "pad": [8, 10, 20, 34], "pad_max": 6, "pad_val": 6, "padding_mod": 34, "page": [2, 6, 13, 20, 24, 29, 30, 33, 34], "pagedattent": [2, 34], "paper": [2, 34], "parallel": [2, 5, 6, 7, 28, 33, 34], "param": [2, 19, 31], "param_i": 19, "param_n": 19, "paramet": [2, 6, 7, 8, 10, 16, 17, 19, 20, 21, 26, 28, 29, 30, 31, 33, 34], "parse_arg": [6, 23], "parser": [6, 23], "part": [3, 5, 7, 8, 18, 21, 26, 31, 33, 34], "parti": 34, "partial": 7, "particular": [5, 6, 8, 29, 34], "partit": [13, 33], "pass": [1, 2, 5, 10, 17, 20, 26, 32, 34], "past": 28, "past_key_valu": [2, 6], "past_kv_length": 2, "patch": [10, 34], "path": [2, 6, 7, 14, 18, 20, 23, 31, 33, 34], "pattern": [7, 11, 18, 28, 34], "pdf": 2, "pdropout": 2, "peak": [2, 7, 11, 34], "penal": 33, "pend": 34, "per": [2, 10, 15, 16, 20, 30, 31, 32, 33, 34], "per_batch": 2, "per_batch_ic_block": 2, "per_channel_symmetr": [2, 6, 15], "per_device_train_batch_s": [10, 34], "per_ic_block": 2, "per_tensor": 2, "per_tensor_affin": [6, 15, 34], "per_tensor_symmetr": 15, "perchannelminmaxobserv": [2, 6, 15], "perf": [11, 18], "perfect": 28, "perform": [1, 2, 3, 4, 6, 7, 8, 9, 10, 13, 14, 15, 16, 18, 19, 21, 25, 28, 29, 31], "period": 33, "person": 3, "perspect": [2, 13, 18, 21, 28, 31, 33], "pertain": 17, "phase": [2, 20], "phi": [2, 28, 34], "physic": [2, 14, 20, 32, 33], "pick": 5, "piec": [2, 20], "pile": 6, "pin": [2, 20], "pinvers": 8, "pip": [4, 5, 34], "pip3": 34, "place": [2, 8, 28, 33, 34], "placeholderobserv": [6, 15], "placement": 33, "plai": [7, 33], "plan": [5, 7, 10], "platform": [3, 7, 18, 32, 33, 34], "platinum": [14, 30, 32, 33], "pleas": [2, 6, 7, 11, 16, 22, 26, 28, 29, 31, 33, 34], "plu": 33, "pmi_rank": 6, "pmi_siz": [6, 29], "point": [2, 6, 8, 15, 21, 33, 34], "pointer": 17, "poisson_nll_loss": 8, "polar": 8, "polici": 33, "polish": 34, "polymorph": 17, "pool": [2, 20, 34], "poor": [26, 34], "popular": [1, 7, 22, 28, 30, 34], "popup": 5, "port": 31, "portabl": 11, "portion": 16, "pos_embd_dim": 2, "posit": [2, 28, 33, 34], "position_id": [2, 6], "position_ids_pad": 6, "possibl": [2, 14, 15, 19, 28, 31, 33, 34], "post": [2, 4, 5, 7, 15, 28, 34], "potenti": [3, 7, 34], "pow": 13, "power": [2, 7, 33, 34], "ppn": 31, "pr": [7, 18, 34], "practic": [6, 21, 24, 28, 33], "pragma": 17, "pre": [2, 28, 34], "precis": [2, 4, 6, 13, 21, 23, 26, 30, 34], "pred": 16, "predefin": 2, "predict": 16, "prefer": [1, 7, 8, 15, 24], "prefetchw": 17, "prefetchwt1": 17, "prefil": 2, "prefix": 31, "preload": [2, 31], "prepack": [2, 6, 10, 18, 26, 28, 34], "prepar": [2, 4, 6, 13, 16, 26, 29, 32, 34], "prepared_model": [2, 4, 6, 13, 15, 16, 26, 29, 34], "prerequisit": [5, 6], "present": 32, "pretrain": [6, 32, 34], "pretti": 33, "prevent": 19, "previou": [14, 16, 18, 33, 34], "previous": 32, "primari": 33, "primarili": [8, 34], "primit": [11, 20, 30, 34], "principl": [3, 18], "print": [6, 11, 12, 13, 14, 16, 17, 23, 31], "printf": 5, "prior": [2, 23], "privat": 34, "probabl": 2, "problem": [7, 19, 26, 32, 33], "proc": 31, "procedur": 32, "process": [2, 6, 7, 11, 12, 14, 16, 19, 20, 21, 26, 31, 32, 33], "processor": [3, 7, 19, 21, 28, 30, 33, 34], "proclist": 33, "prod": 8, "produc": [5, 8], "product": [1, 2, 7, 14, 28, 34], "program": [1, 5, 7, 11, 20, 31, 33, 34], "progress": [26, 28, 34], "project": [1, 6], "prompt": [4, 6, 23, 34], "propag": [13, 21, 33], "proper": 34, "properli": 31, "properti": [6, 32], "propos": [5, 7, 11, 16, 18, 21], "prototyp": [4, 13, 20, 26, 34], "provid": [1, 2, 5, 6, 7, 8, 11, 12, 13, 14, 16, 20, 22, 24, 26, 28, 29, 31, 32, 33, 34], "pseudo": [19, 21, 34], "pseudocod": [26, 34], "pt": [2, 6, 13, 14, 15, 23, 32, 34], "pth": 6, "pthread": 20, "ptmalloc": 32, "ptq": 7, "public": 34, "pull": 5, "purlei": 33, "purpos": [17, 31, 32, 33], "push": 34, "push_back": 6, "put": 33, "py": [2, 5, 10, 14, 20, 31, 32, 34], "pyg": 3, "pyi": 5, "pypi": [26, 34], "python": [1, 2, 4, 10, 14, 17, 20, 26, 28, 29, 31, 32, 33, 34], "pytorch": [2, 3, 4, 6, 7, 8, 9, 10, 13, 14, 16, 17, 20, 23, 25, 26, 27, 28, 29, 30, 31, 33, 34], "q": [2, 28], "qa": [10, 34], "qconf_summari": [6, 15, 16, 29], "qconfig": [2, 4, 6, 13, 16, 26, 29, 32, 34], "qconfig_map": 6, "qconfig_summary_fil": [2, 6, 29], "qconfig_summary_file_path": 29, "qconfigmap": 6, "qint8": [2, 6, 15], "qkv": 34, "qparam": 15, "qr": 8, "qscheme": [2, 6, 15, 34], "qualiti": 34, "quant": [2, 16], "quant_stat": 15, "quantconf": 34, "quantil": 8, "quantiz": [1, 3, 4, 13, 22, 26, 28, 30, 32, 34], "quantizat": 2, "quantization_config": [2, 6, 29], "quantize_per_tensor": 26, "quantized_model": [13, 15, 34], "queri": [2, 17, 18], "query_roteri": 2, "query_token": 2, "question": [18, 30], "quick": [1, 20, 24, 25], "quick_check": 5, "quickli": 2, "quicklint": 5, "quickstart_tutori": 16, "quint8": [6, 15], "quit": [17, 34], "qwen": [2, 28, 34], "qwen2": [28, 34], "r": [5, 6, 7, 14, 23, 30, 32, 33], "rais": [2, 10], "rand": [6, 8, 12, 13, 20, 26, 34], "randint": [6, 11, 32], "randn": [2, 10, 13, 16, 18, 32, 34], "random": 14, "rang": [1, 6, 7, 15, 16, 19, 21, 26, 31, 32, 34], "rank": [6, 31, 34], "rapid": 3, "rate": 21, "rather": [2, 18], "ratio": [22, 30, 34], "raw": 2, "rc": 34, "rc3": 34, "re": [5, 8, 32, 33, 34], "reach": 34, "read": [7, 19], "readm": 34, "real": [2, 7, 14, 15, 30, 34], "realli": 5, "realtim": 30, "reason": [2, 10, 18, 20, 34], "rebas": [5, 34], "receip": [16, 20], "receipt": 20, "receiv": 21, "recent": [6, 7, 18], "recip": [2, 4, 7, 13, 15, 26, 28, 34], "recognit": 33, "recommend": [1, 5, 6, 7, 9, 10, 15, 16, 20, 23, 30, 31, 33, 34], "record": [14, 32], "recov": 21, "recurs": 5, "reduc": [1, 2, 7, 15, 19, 20, 21, 22, 26, 28, 33, 34], "reduce_rang": 15, "reduct": 34, "refer": [1, 6, 7, 9, 13, 14, 16, 17, 18, 20, 22, 23, 24, 25, 32, 34], "refin": 34, "reflection_pad1d": 8, "reflection_pad2d": 8, "regard": 13, "regardless": [8, 34], "region": [2, 8, 17, 33], "regist": [1, 7, 10, 16, 17, 34], "registr": 7, "regress": [9, 34], "regular": [6, 21], "reinstal": [5, 26], "reinterpret": 18, "reinterpret_cast": 17, "rel": [2, 4, 16, 31, 34], "relat": [2, 6, 13, 17, 31, 33, 34], "releas": [1, 17, 18, 26, 30, 33], "reli": [18, 20], "relu": [2, 7, 13, 16, 18, 26, 34], "relu6": 34, "remain": 32, "remaind": [2, 20], "remark": [26, 30, 33], "remot": 33, "remov": [2, 5, 21, 34], "reorder": [2, 18, 28], "reorder_cach": 28, "repeat": [10, 18, 21], "repeatedli": 5, "replac": [2, 5, 7, 10, 26, 34], "replace_dropout_with_ident": 2, "replication_pad1d": 8, "replication_pad2d": 8, "replication_pad3d": 8, "repo": [5, 6, 7], "repo_root": 29, "report": [1, 17], "repres": [5, 7, 21], "represent": 18, "reproduc": 32, "request": [1, 5, 20, 32], "requir": [2, 5, 6, 8, 10, 16, 18, 21, 26, 28, 29, 31, 32, 34], "research": 28, "reserv": 33, "reshape_and_cach": 2, "residu": 31, "resiz": [6, 13], "resnet18": 34, "resnet18_xpu": 34, "resnet34": [30, 34], "resnet3d": 34, "resnet50": [12, 13, 14, 18, 30, 31, 33, 34], "resnet50_weight": [6, 12, 13], "resnext": 30, "resnext101": [18, 34], "resnext3d": 34, "resolv": 34, "resourc": [13, 20, 28, 32, 33], "respect": [14, 16, 30, 31, 34], "respons": 30, "rest": 32, "restart": 32, "result": [1, 2, 6, 10, 12, 14, 16, 18, 20, 21, 30, 31, 32, 33], "retinanet": 34, "retriev": 33, "return": [2, 6, 7, 8, 10, 16, 17, 20, 26, 34], "return_softmax": 2, "return_tensor": [6, 23], "reus": [2, 33], "review": [7, 34], "rf": 5, "rfc": 18, "rh": 17, "right": [7, 21, 23, 28], "risk": 34, "rm": 5, "rms_norm": [2, 34], "rmsnorm": [2, 28, 34], "rmsnorm_modul": 2, "rn50": [13, 34], "rn50_int8_jit": 32, "rn50_ipex_int8": 32, "rnn": 34, "rnncell": 15, "rnnt": [26, 34], "ro": 2, "roberta": [26, 34], "roialign": [7, 34], "role": 33, "root": [6, 13, 16, 17, 28], "rope": [28, 34], "rope_modul": 2, "rotari": [2, 28], "rotary_dim": 2, "rotary_embed": [2, 34], "rotary_half": 2, "rotary_ndim": 2, "rotaryembed": [2, 34], "roughli": 18, "round": [13, 21], "rounding_bia": 17, "row": 7, "rst": 5, "rule": [21, 34], "run": [2, 4, 5, 6, 7, 8, 10, 12, 14, 16, 20, 26, 30, 31, 32, 33, 34], "run_20210712212258_inst": 31, "run_20210712212258_instance_0_cores_0": 31, "run_20210712214504_inst": 31, "run_20210712214504_instance_0_cores_22": 31, "run_20210712220928_inst": 31, "run_20210712220928_instance_0_cores_0": 31, "run_20210712221150_inst": 31, "run_20210712221150_instance_0_cores_0": 31, "run_20210712221150_instance_1_cores_22": 31, "run_20210712221305_inst": 31, "run_20210712221305_instance_0_cores_0": 31, "run_20210712221305_instance_1_cores_11": 31, "run_20210712221305_instance_2_cores_22": 31, "run_20210712221305_instance_3_cores_33": 31, "run_20210712221415_inst": 31, "run_20210712221415_instance_0_cores_0": 31, "run_20210712221415_instance_10_cores_40": 31, "run_20210712221415_instance_1_cores_4": 31, "run_20210712221415_instance_2_cores_8": 31, "run_20210712221415_instance_3_cores_12": 31, "run_20210712221415_instance_4_cores_16": 31, "run_20210712221415_instance_5_cores_20": 31, "run_20210712221415_instance_6_cores_24": 31, "run_20210712221415_instance_7_cores_28": 31, "run_20210712221415_instance_8_cores_32": 31, "run_20210712221415_instance_9_cores_36": 31, "run_20210712221615_inst": 31, "run_20210712221615_instance_0_cores_11": 31, "run_20210712223308_inst": 31, "run_20210712223308_instance_0_cores_0": 31, "run_20210713152500_instance_0_cores_0": 31, "run_20210713153048_instance_0_cores_0": 31, "run_20210713153333_instance_0_cores_0": 31, "run_20210713153659_instance_0_cores_0": 31, "run_20220106130151_instance_0_cores_0": 31, "run_benchmark": [26, 34], "run_qa": [10, 34], "runner": 5, "running_mod": 34, "runtim": [1, 8, 13, 17, 31, 33, 34], "runtimeerror": [26, 34], "s1": 20, "s7": 34, "s8": 34, "sacrif": 8, "sai": 5, "salesforc": 28, "same": [2, 5, 7, 10, 15, 16, 17, 18, 20, 21, 28, 31, 32, 33, 34], "same_model_execution_again": 34, "sampl": [2, 6, 9, 14, 16, 17, 29, 33], "sample_input": [2, 9, 34], "sample_text_captum_input": 32, "sampler": 6, "sampling_s": [2, 4, 16, 34], "sapphir": 3, "satisfi": [15, 26], "satur": 34, "save": [2, 5, 6, 7, 13, 14, 15, 16, 18, 21, 28, 32, 34], "save_qconf_summari": [6, 15, 16, 29], "scalabl": [3, 7, 21, 28, 30, 33, 34], "scalar": 2, "scalartyp": 17, "scalartypetocpptyp": 17, "scale": [2, 3, 6, 15, 28], "scale_attn": 2, "scale_kei": 2, "scaled_dot_product_attent": 2, "scatter": 31, "scenario": [2, 6, 7, 18, 33, 34], "schedul": [1, 2, 13, 20, 31, 33], "scheme": 32, "scope": [2, 7, 8, 21, 34], "script": [1, 2, 3, 4, 5, 6, 7, 8, 10, 14, 17, 20, 23, 24, 26, 28, 29, 30, 32, 33, 34], "scriptmodul": [2, 13, 20], "sdk": 34, "search": [1, 2, 4, 5, 7, 16, 22, 28, 31], "sec": 30, "second": [2, 10, 28, 32, 33], "secondli": 28, "secret": 18, "section": [1, 6, 7, 8, 14, 20, 23, 24, 25, 28, 29, 32, 33, 34], "secur": 3, "see": [1, 2, 5, 8, 14, 34], "seed": 2, "seen": 28, "select": [2, 5, 7, 13, 24, 34], "self": [2, 6, 8, 10, 16, 20, 26, 34], "selu": 34, "semant": 18, "sens": 21, "sep": [3, 17], "separ": [7, 19, 27, 33], "seq_classification_artifact": 32, "seq_info": 2, "seq_len": [2, 30], "seq_length": [6, 11, 32], "seqlen_k": 2, "seqlen_q": 2, "sequenc": [2, 18, 21, 28, 34], "sequenti": 16, "seri": 33, "serv": [20, 34], "server": [32, 33], "servic": [6, 28, 30, 33], "session": 30, "set": [1, 2, 4, 5, 6, 7, 8, 14, 15, 16, 17, 21, 24, 26, 28, 30, 31, 32, 33, 34], "set_flush_denorm": 33, "set_format": 6, "set_glob": 6, "set_num_thread": [26, 34], "set_properti": 6, "sete": 15, "settensorexprfuseren": 26, "setup": [5, 6, 28, 34], "setup_config": 32, "setup_lint": 5, "sever": [2, 7, 10, 19, 30, 31, 34], "sgd": [2, 6, 7, 8, 16, 19], "sgemm": 34, "sha": 17, "shall": [5, 18, 33], "shape": [2, 6, 7, 16, 20, 23, 30, 33, 34], "shard": 28, "share": [1, 5, 6, 16, 20, 32, 33, 34], "share_weight_observ": 2, "shared_criterion": [16, 22], "sheet": 23, "shift": 21, "ship": 28, "short_factor": 2, "shortcut": 34, "shorten": 5, "shorter": [21, 28], "should": [2, 5, 8, 15, 20, 28, 31, 33], "show": [8, 17, 21, 28, 29, 30, 31, 32, 33, 34], "shown": [1, 6, 18, 28, 31, 32], "shuffl": 6, "shufflenet": 30, "shufflenetv2_x1": 30, "side": [15, 33], "sigmoid": [13, 34], "sign": 21, "signficantli": 32, "signifi": 28, "signific": 21, "significantli": [28, 34], "silu": [2, 13], "similar": [15, 17, 33], "similarli": 32, "simpl": [5, 7, 8, 11, 18, 33, 34], "simplenet": [8, 34], "simpli": [7, 26, 31], "simplifi": [10, 34], "simultan": 20, "sin": 2, "sinc": [6, 7, 18, 19, 20, 21, 26, 33, 34], "sincer": 34, "singl": [2, 7, 13, 14, 16, 19, 20, 30, 32, 34], "single_query_cached_kv_attent": 2, "site": 32, "situat": [7, 14], "six": 33, "size": [2, 6, 7, 11, 15, 16, 17, 18, 23, 26, 28, 30, 32, 33, 34], "sizeof": 17, "skip": [5, 6, 17, 18], "skip_special_token": [6, 23], "skylak": 15, "sleef": 17, "sleep": 33, "slice": [6, 18], "sliu": 34, "slope": 2, "slot": [2, 30], "slot_map": 2, "slow": 34, "slower": [8, 33, 34], "small": [7, 19, 33, 34], "smaller": [8, 17], "smooth": 7, "smooth_l1_loss": 8, "smoothquant": [2, 6, 7, 16, 22, 28, 34], "smoothquant_arg": [2, 16], "snippet": [10, 29], "so": [2, 5, 6, 7, 8, 15, 17, 18, 20, 30, 31, 32, 33, 34], "sock": 32, "socket": [14, 30, 32, 33, 34], "soft_margin_loss": 8, "softmax": [2, 13, 34], "softmax_scal": 2, "softwar": [3, 27, 34], "sole": 33, "solut": [2, 7, 26, 28, 34], "solv": [7, 19, 33], "some": [2, 5, 7, 8, 13, 16, 17, 18, 20, 26, 28, 31, 32, 33, 34], "someth": 18, "sometim": [31, 33], "sophist": 33, "sourc": [1, 5, 6, 17, 27, 28, 33, 34], "space": [2, 7, 16, 18, 22, 33], "spars": [7, 18, 34], "sparsiti": 2, "spawn": [7, 20], "special": [17, 18, 28], "specif": [1, 2, 5, 6, 7, 12, 18, 20, 26, 28, 31, 33, 34], "specifi": [2, 5, 6, 14, 20, 31, 33, 34], "specifii": 17, "spectr": 30, "speech": [3, 33], "speed": [2, 7, 11, 19, 28, 33, 34], "speedup": [2, 6, 8, 28, 30, 34], "sphinx": 5, "split": [2, 6, 7, 16, 17, 19, 20, 26, 34], "split_bf16_from_fp32": 21, "split_master_weight_for_bf16": 2, "splitsgd": [7, 21], "spontan": 18, "sqrt": [2, 13, 19], "squad": [10, 30, 34], "squar": [13, 28], "squenc": 2, "src": [2, 17], "src_data_ptr": 18, "src_md": 18, "src_mem": 18, "ssd": [30, 34], "sse": 17, "sse2": 17, "sse3": 17, "sse4_1": 17, "sse4_2": 17, "ssse3": 17, "stabil": [2, 8, 34], "stabilityai": 28, "stabl": [2, 3, 8, 34], "stablelm": [2, 28], "stack": [6, 8], "stage": [7, 10, 19, 20, 29, 33, 34], "stakehold": 34, "stall": 33, "standard": [1, 34], "stanford": 34, "starcod": [28, 34], "start": [1, 3, 4, 5, 6, 7, 10, 20, 24, 34], "start_dim": 20, "state": [2, 15, 19, 28], "state_dict": [2, 6, 34], "state_sum": 19, "state_sum_i": 19, "state_sum_n": 19, "statement": [14, 17], "static": [2, 4, 16, 26, 28, 31, 32, 33, 34], "static_quantized_model": 6, "staticquantizationint8": 28, "statist": 7, "statu": 17, "std": [6, 17, 19], "stdio": 5, "stdout": 31, "stead": 17, "steam": [20, 34], "step": [2, 5, 6, 7, 8, 14, 16, 19, 21, 32], "step_siz": [16, 22], "stft": 8, "stick": 7, "still": [2, 5, 7, 8, 13, 16, 18, 21, 26, 34], "stock": [13, 30, 34], "stop": [2, 33], "storag": 19, "store": [2, 17, 18, 19, 21, 28, 31, 32, 33, 34], "store_tru": [6, 23], "str": [2, 6, 14, 23, 31], "straight": [13, 33], "straightforward": 34, "strategi": [14, 31, 33, 34], "stream": [2, 7, 20, 34], "streamlin": 34, "strict": [6, 32], "stride": [8, 10, 20, 34], "stride_c": 18, "stride_h": 18, "stride_n": 18, "stride_w": 18, "string": [2, 31], "structur": [1, 18, 31, 34], "style": [2, 5], "sub": [20, 28, 33], "subfold": 17, "subgraph": 2, "subject": [7, 17, 20, 27, 34], "submit": [1, 5, 7, 20], "submodul": 5, "subsequ": [18, 33], "substr": 5, "success": [10, 24], "suffer": 20, "suffix": 17, "suggest": [1, 2, 15, 18, 20, 33, 34], "suit": 5, "sum": [13, 16, 18, 19, 34], "summar": 26, "summari": [6, 34], "super": [8, 10, 16, 20, 26, 34], "superset": 20, "suppli": 8, "support": [2, 5, 6, 7, 13, 15, 16, 17, 18, 19, 20, 21, 25, 26, 28, 29, 31, 32, 33, 34], "suppos": [2, 6, 14, 33], "sure": [5, 14, 15, 32, 33], "svd": 8, "sw": 30, "swish": 34, "switch": [7, 17, 31, 33, 34], "sy": 30, "sycl": 1, "symbol": 20, "symeig": 8, "symlink": 5, "symmetr": 15, "sync": [5, 20], "synchron": [20, 26, 34], "sysctl": 33, "system": [17, 33], "systemat": 7, "t": [2, 5, 7, 8, 14, 15, 16, 17, 18, 20, 26, 32, 34], "t5": [2, 26, 28, 34], "t_valu": 17, "tab": 5, "tabl": [2, 7, 17, 28, 30, 34], "tackl": 7, "tacotron2": 34, "take": [1, 2, 7, 8, 10, 12, 13, 14, 18, 21, 25, 26, 30, 31, 33], "taken": 32, "tanh": [13, 34], "target": [5, 6, 10, 13, 14, 17, 34], "target_link_librari": 6, "target_v": 14, "task": [2, 7, 28, 31, 33, 34], "task1": 20, "task2": 20, "taskset": 31, "tbd": 26, "tc": 14, "tcmalloc": 32, "te": 34, "team": [1, 5], "techniqu": [1, 2, 7, 11, 12, 28, 34], "technolog": [1, 7, 28], "technologi": [3, 7], "tee": 31, "tell": [18, 20, 31, 33], "temperatur": [6, 23], "tenosr": 2, "tensor": [2, 6, 7, 8, 11, 15, 16, 17, 20, 26, 28, 32, 34], "tensorexpr_fus": 26, "tensorflow": 18, "tensoriter": 18, "terabyt": 30, "term": 27, "termin": 14, "test": [7, 16, 17, 30, 34], "test_": 5, "test_alias_analysi": 5, "test_bceloss": 5, "test_data": 16, "test_dataload": 16, "test_jit": 5, "test_mseloss": 5, "test_nn": 5, "test_sequenti": 5, "testclassnam": 5, "testjit": 5, "testnam": 5, "testnn": 5, "testsuit": 5, "text": [3, 6, 26, 28, 30, 33], "text_max_length": 2, "tgi": 34, "than": [2, 5, 7, 17, 18, 20, 21, 26, 31, 33, 34], "thank": [5, 34], "thei": [2, 7, 8, 31, 33], "them": [1, 5, 7, 18, 19, 28, 31, 33], "themselv": [31, 34], "therefor": 33, "thi": [2, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 26, 27, 28, 29, 30, 31, 34], "thing": [14, 33], "third": [19, 34], "those": [2, 15, 33], "though": [2, 7], "thrash": 33, "threa": 34, "thread": [1, 2, 7, 20, 26, 30, 31, 32, 33, 34], "three": [7, 16, 17], "threshold": 33, "through": [1, 2, 6, 7, 8, 12, 25, 28, 33, 34], "throughput": [2, 3, 18, 20, 26, 28, 30, 34], "thu": [2, 7, 8, 10, 18, 20, 21, 28, 31, 32, 33], "thudm": 28, "tidi": 5, "tightli": 34, "tiiuae": 28, "tile": 17, "time": [2, 5, 7, 14, 16, 17, 18, 19, 26, 28, 30, 33, 34], "timeout": [2, 5, 21], "timestamp": [2, 28], "tip": 17, "tmp": [10, 32, 34], "to_bfloat16_train": 7, "to_dens": 18, "to_mkldnn": 18, "togeth": [7, 14, 20, 33, 34], "toggl": 7, "token": [2, 6, 23, 28, 30], "tokenize_funct": 6, "tolist": 16, "tool": [17, 33, 34], "toolset": 17, "top": [10, 21, 34], "top1": 30, "toplevel": 5, "topologi": [7, 18, 19, 26, 30, 31, 33, 34], "torch": [1, 2, 4, 6, 8, 10, 11, 12, 13, 15, 16, 18, 20, 23, 26, 29, 32, 33, 34], "torch_ccl": 6, "torch_check": 17, "torch_dtyp": [6, 23], "torch_ipex": [17, 34], "torch_ipex_librari": 6, "torchconfig": 6, "torchdynamo": [1, 7, 12, 23, 34], "torchrun": 34, "torchscirpt": 2, "torchscript": [1, 2, 5, 7, 10, 11, 12, 19, 23, 26, 32, 34], "torchserv": [3, 34], "torchvis": [6, 10, 12, 13, 16, 18, 32, 34], "torchvison": 34, "total": [2, 6, 30, 33], "total_new_token": [6, 23], "totensor": [6, 13, 16], "tpp": [2, 28], "trace": [1, 6, 7, 8, 12, 13, 15, 16, 20, 23, 26, 32, 34], "trace_model": 34, "traced_model": [6, 10, 13, 15, 16, 26, 34], "traced_model1": 20, "traced_model2": 20, "track": 1, "track_running_stat": 10, "trade": [8, 28, 30, 34], "tradeoff": 15, "trail": [5, 21], "train": [2, 3, 4, 7, 11, 13, 15, 16, 18, 21, 23, 26, 28, 31, 34], "train_dataload": 16, "train_dataset": [6, 13], "train_load": [6, 8], "training_data": 16, "transfer": 33, "transform": [2, 3, 4, 6, 10, 11, 13, 16, 18, 22, 23, 28, 29, 32, 33, 34], "transformer_handler_gener": 32, "transformerencoderlay": 26, "transnetv2": 34, "transpar": [2, 7, 29, 33, 34], "transpos": [13, 34], "tree": [5, 6], "tri": 12, "trial": 14, "triangular_solv": 8, "trigger": 12, "triplet_margin_loss": 8, "true": [2, 4, 6, 10, 12, 13, 14, 15, 16, 17, 22, 23, 31, 32, 33, 34], "trust_remote_cod": [6, 23], "truth": 21, "try": [2, 5, 6, 7, 12, 14, 16, 26, 31, 33, 34], "tunabl": [30, 32], "tune": [2, 3, 4, 7, 8, 15, 20, 26, 28, 29, 31, 32, 34], "tuned_conf": 16, "tuned_model": [4, 16, 34], "tunin": 32, "tuning_tim": [2, 4, 16, 34], "tupl": [2, 6, 17, 20], "turboboost": 30, "turn": [7, 34], "tutori": [5, 6, 15, 16, 29, 34], "two": [2, 7, 14, 16, 20, 21, 28, 32, 33, 34], "txt": [5, 6, 32], "type": [2, 4, 5, 6, 7, 10, 16, 17, 18, 20, 21, 23, 30, 31, 32, 34], "types": 17, "typic": [6, 10, 28, 33, 34], "u": [30, 32], "u7": 34, "u8": 34, "ubuntu": 30, "ucod": 30, "uint32_t": 17, "uint4": 2, "ultra": 33, "uma": 33, "unabl": 10, "unalign": [17, 34], "uncas": [4, 6, 10, 11, 32, 34], "undefin": [20, 33], "under": [2, 6, 8, 18, 20, 27, 31, 34], "undergo": 26, "underhood": 34, "underli": [1, 17, 28], "underneath": 34, "understand": [21, 28, 33], "undesir": 31, "unexpect": 2, "unifi": [2, 31], "uniform": 32, "uninstal": 5, "union": 2, "unit": [1, 2, 33], "unittest": 5, "unix": 32, "unlik": 6, "unlist": 8, "unnecessari": 33, "unpack": [26, 34], "unpad": 2, "unpredict": 2, "unrel": 6, "unsign": 34, "unsqueez": 2, "unstabl": 8, "until": [5, 20, 21, 33], "untrack": 5, "unus": [31, 33], "unutil": 32, "up": [2, 3, 7, 11, 20, 24, 28, 33, 34], "updat": [2, 5, 7, 16, 19, 21, 22, 34], "upgrad": 34, "upi": 33, "upload": 34, "upper": [18, 33], "upsampl": [18, 34], "upstream": [7, 18, 34], "url": [32, 34], "us": [1, 2, 3, 4, 5, 6, 11, 14, 15, 17, 18, 19, 21, 23, 24, 25, 26, 27, 28, 32, 33, 34], "usabl": 34, "usag": [2, 6, 7, 8, 23, 25, 32, 33, 34], "use_all_nod": 14, "use_default_alloc": [32, 34], "use_logical_cor": [14, 32], "user": [1, 2, 7, 9, 10, 12, 13, 15, 16, 18, 20, 26, 31, 32, 33, 34], "user_model": [6, 15], "usr": [6, 17, 31, 32], "usual": [2, 18, 20, 33], "usuali": 33, "usus": 32, "ut": 31, "util": [1, 6, 7, 10, 13, 15, 16, 18, 21, 28, 31, 33, 34], "ux": 34, "v": 5, "v0": [28, 34], "v1": [28, 34], "v2": [28, 30, 34], "v3": 34, "valid": [2, 21, 34], "valu": [2, 6, 10, 14, 16, 17, 19, 20, 21, 22, 26, 28, 31, 32, 33, 34], "value_cach": 2, "value_token": 2, "var": 29, "vari": 16, "variabl": [2, 5, 17, 30, 31, 32, 33, 34], "varianc": 34, "variance_epsilon": 2, "variant": [2, 8, 28, 34], "variou": [6, 7, 14, 28, 33, 34], "varlen_attent": [2, 34], "varlenattent": [2, 34], "varlenattention_modul": 2, "ve": 34, "vec256": 17, "vec512": 17, "vec_bia": 17, "vector": [1, 2, 6, 17, 18, 25, 28], "vectors": 17, "verbos": [2, 4, 31], "verbose_off": 2, "verbose_on": 2, "verbose_on_cr": 2, "veri": [2, 5, 15, 18, 28], "verifi": [6, 7], "version": [6, 7, 16, 17, 25, 26, 27, 32, 33, 34], "vgg": 30, "vgg11": 30, "via": [2, 5, 6, 7, 18, 20, 30, 31, 33, 34], "video": 7, "view": [13, 18, 20, 21], "view_as_complex": 8, "virtual": 17, "virtual_env": [31, 32], "vision": [3, 6, 30], "visit": [7, 33], "vllm": [2, 34], "vm": 34, "vnni": [1, 15, 17, 25, 28], "vocab_s": [6, 11, 32], "voic": 33, "void": 17, "vstack": 6, "w": [7, 16, 18, 21, 30, 32], "wa": [7, 31, 32, 33, 34], "wai": [5, 10, 16, 18, 28, 34], "wait": [20, 33], "wake": 20, "walk": 34, "want": [2, 5, 7, 14, 15, 17, 20, 31, 34], "warm": 33, "warn": [5, 6, 12, 31, 32, 34], "wav2vec2": 33, "wave2vec": 34, "wc": 18, "we": [1, 2, 5, 6, 7, 8, 9, 10, 14, 15, 16, 17, 18, 19, 20, 21, 23, 28, 30, 32, 33, 34], "web": 28, "webpag": 34, "websit": 7, "wei_ic_observ": 2, "wei_observ": 2, "weight": [1, 2, 7, 10, 12, 13, 15, 16, 18, 20, 22, 23, 26, 28, 34], "weight_dacai": 21, "weight_decai": [7, 19], "weight_dtyp": [2, 6, 29], "weight_kei": 2, "weights_prepack": [2, 6, 7, 23, 26], "well": [1, 2, 5, 6, 7, 11, 16, 20, 21, 24, 28, 32, 33, 34], "were": [30, 31, 32, 33], "west": 30, "what": [3, 5, 6, 8, 23], "wheel": 34, "when": [2, 5, 6, 7, 8, 9, 14, 18, 19, 20, 21, 22, 25, 26, 28, 30, 31, 32, 33, 34], "where": [2, 5, 7, 16, 21, 33, 34], "wherea": 30, "whether": [2, 6, 8, 16, 18, 22, 23, 33], "which": [1, 2, 5, 7, 8, 10, 14, 15, 16, 17, 18, 20, 26, 28, 30, 31, 32, 33, 34], "while": [2, 7, 8, 11, 12, 18, 21, 26, 28, 31, 32, 33, 34], "whisper": [2, 28, 34], "whl": 34, "who": 10, "whole": [19, 20, 33], "wide": [21, 34], "wider": 1, "widespread": [1, 7, 28], "width": [17, 18], "wikipedia": [13, 33], "wise": [2, 16, 19, 22, 29, 34], "wish": [5, 7], "with_arg": [2, 6, 15], "within": [5, 16, 21, 29, 33, 34], "without": [2, 5, 6, 7, 8, 10, 16, 20, 21, 26, 32, 34], "wlydcrb1": 30, "wn": 18, "won": [2, 7, 8, 17, 26], "woq": [2, 28], "woqactquantmod": 2, "woqlowpmod": [2, 6, 29], "woqweightdtyp": [2, 6, 29], "work": [2, 5, 6, 7, 14, 15, 17, 20, 26, 28, 29, 31, 33, 34], "workabl": 2, "workaround": [26, 34], "worker": [20, 31], "workflow": 34, "workload": [1, 6, 7, 8, 10, 11, 12, 21, 26, 28, 29, 30, 31, 33, 34], "workload1": 30, "workspac": 6, "world": [5, 7], "world_siz": [6, 29], "worri": 32, "worth": 34, "would": [2, 5, 6, 14, 16, 17, 18, 30, 31, 32, 33, 34], "wrap": 34, "write": [7, 17], "written": [5, 6, 17], "x": [1, 2, 5, 6, 8, 10, 13, 15, 16, 17, 18, 20, 21, 23, 26, 34], "x1": 20, "x2": 20, "x86": 3, "x86_64": 30, "xcr0": 17, "xdf": 5, "xe": 33, "xeon": [3, 7, 14, 21, 28, 30, 32, 33, 34], "xl": 28, "xlm": 26, "xmx": 1, "xpu": [1, 2, 3, 34], "xsave": 17, "xx": 6, "xx_c": 34, "xx_v": 34, "y": [8, 15, 16, 20, 21, 34], "y1": 20, "y1_futur": 20, "y2": 20, "y2_futur": 20, "y_runtim": 20, "yaml": 14, "ye": 5, "year": 28, "yet": [2, 6, 26, 34], "yield": [1, 7, 33], "yolov3": 34, "you": [1, 2, 5, 6, 7, 8, 13, 14, 15, 17, 18, 20, 23, 25, 26, 28, 29, 31, 33, 34], "your": [1, 5, 6, 7, 8, 10, 14, 15, 20, 23, 24, 26, 27, 28, 29, 34], "your_calibration_dataset": 29, "your_conf_fil": [4, 34], "your_generation_param": 34, "your_python_script": [4, 34], "your_pytorch_script": [4, 31], "yuan": [2, 28], "yuan2": 28, "z11pa": 33, "zero": [2, 6, 15, 34], "zero_grad": [6, 7, 16], "zero_point_kei": 2, "zero_tensor": 2, "zip": [6, 23, 34], "zone": [30, 34], "zoo": 30, "\u03b1": 21}, "titles": ["Intel\u00ae Extension for PyTorch* CPU ISA Dynamic Dispatch Design Doc", "Intel\u00ae Extension for PyTorch*", "API Documentation", "Blogs & Publications", "Cheat Sheet", "Contribution", "Examples", "Features", "Auto Mixed Precision (AMP)", "Auto Channels Last", "Codeless Optimization (Prototype)", "Fast BERT (Prototype)", "Graph Capture (Prototype)", "Graph Optimization", "HyperTune (Prototype)", "Intel\u00ae Extension for PyTorch* optimizations for quantization", "INT8 Recipe Tuning API (Prototype)", "ISA Dynamic Dispatching", "Channels Last", "Optimizer Fusion", "Runtime Extension", "Split SGD", "Smooth Quant Recipe Tuning API (Prototype)", "Quick Start", "Installation", "Introduction", "Troubleshooting", "License", "Large Language Models (LLM) Optimization Overview", "LLM Optimizations Frontend API", "Performance", "Launch Script Usage Guide", "TorchServe with Intel\u00ae Extension for PyTorch*", "Performance Tuning Guide", "Releases"], "titleterms": {"": 34, "0": [6, 7, 34], "1": [7, 14, 32, 34], "10": [30, 34], "100": 34, "11": [30, 34], "12": 34, "13": [7, 34], "2": [6, 7, 14, 32, 34], "200": [30, 34], "2xlarg": 30, "3": [32, 34], "300": 34, "4": [32, 34], "8": 34, "9": 34, "That": 18, "The": 10, "__call__": 10, "access": [28, 33], "accuraci": 30, "add": 17, "ai": 30, "algorithm": 16, "all": [18, 31], "alloc": [31, 33], "alpha": [16, 34], "alreadi": 10, "amp": [7, 8], "an": 30, "api": [2, 7, 9, 13, 16, 17, 18, 22, 25, 28, 29], "appli": 10, "architectur": 1, "archiv": 32, "asynchron": 20, "aten": [17, 18], "attr": 10, "auto": [7, 8, 9, 16, 20], "autocast": 8, "autotun": 16, "aw": 30, "b": 18, "basic": 20, "behavior": 8, "benchmark": 32, "bert": [2, 6, 7, 11, 32], "beta": [6, 7], "better": 5, "bf16": [6, 10, 13, 29], "bfloat16": [6, 8, 21, 26, 30], "bind": 20, "block": 18, "blog": 3, "boost": 32, "build": [5, 17], "c": [5, 6, 18], "c6i": 30, "cach": [28, 33], "calibr": [6, 15], "can": 8, "captur": [7, 12], "case": [8, 10, 20], "center": 30, "chang": 34, "channel": [7, 9, 18, 33], "cheat": 4, "check": 17, "code": 17, "codegen": 17, "codeless": [7, 10], "command": 10, "common": 29, "compil": [7, 17], "configur": [20, 30, 33], "content": [32, 33], "contribut": 5, "convers": 18, "convert": 15, "convolut": 18, "core": [20, 31, 32], "correct": 26, "coverag": 18, "cpp": 17, "cpu": [0, 2, 17, 18, 33], "creat": [18, 32], "creation": 18, "csrc": 17, "custom": [17, 28], "d": 18, "data": [28, 30], "debug": [5, 17], "deepspe": [28, 29], "default": [8, 9, 14, 18, 31], "defin": [14, 15], "demo": 28, "denorm": 33, "deploi": [15, 32], "deploy": 6, "descent": 21, "descript": [11, 12], "design": [0, 17, 20, 31], "detail": 20, "determin": 16, "develop": 5, "disabl": 9, "dispatch": [0, 7, 17], "dispatchstub": 17, "distribut": [6, 28, 29], "do": 15, "doc": 0, "document": [2, 5, 25, 32, 33], "dure": 20, "dynam": [0, 6, 7, 15, 17, 26], "dyndisp": 17, "eager": [6, 8], "eas": [9, 13], "easi": 7, "ec2": 30, "elig": 8, "enabl": 9, "exampl": [6, 10, 11, 12, 14, 16, 17, 20, 31], "examples1": 20, "examples2": 20, "examples3": 20, "explicitli": 10, "export": 32, "extens": [0, 1, 5, 7, 15, 20, 26, 32], "fast": [2, 6, 7, 11], "featur": [6, 7, 11, 12, 17], "file": 32, "fix": 16, "float32": [6, 8], "fold": 13, "folder": 17, "format": 18, "forward": 10, "fp32": [6, 10, 13, 29, 30], "from": [6, 7], "frontend": 29, "fusion": [13, 19], "gener": [2, 26], "get": 25, "gnu": [31, 33], "gradient": 21, "graph": [2, 7, 12, 13, 28], "guid": [31, 33], "h": 17, "hardwar": [30, 33], "highlight": 34, "how": 20, "huggingfac": 10, "hyperparamet": 14, "hypertun": [7, 14], "i": [18, 20, 31], "ii": 31, "iii": 31, "implement": [17, 20], "improv": 34, "includ": 31, "index": 31, "indirect": 28, "infer": [6, 8, 28, 29, 31, 32], "input": [8, 20], "instal": [24, 32], "instanc": [28, 30, 31], "instead": 10, "int4": 6, "int8": [6, 7, 13, 16, 26, 30, 32], "intel": [0, 1, 5, 15, 30, 31, 32, 33], "intrin": 17, "introduct": [8, 19, 25], "iomp": 20, "ipex": [10, 28], "isa": [0, 7, 17], "issu": [9, 20, 34], "iv": 31, "jemalloc": [31, 33], "jit": 10, "kernel": [17, 18], "known": [9, 20, 34], "kv": 28, "languag": [6, 7, 28], "larg": [6, 7, 28], "last": [7, 9, 18, 33], "latenc": 31, "launch": [10, 31], "launcher": [14, 32], "layout": 18, "level": [2, 17, 28], "librari": 31, "licens": 27, "linear": 28, "lint": 5, "list": 28, "llm": [2, 6, 7, 23, 28, 29, 30], "load": 20, "local": 5, "logic": 31, "low": 28, "manner": 18, "manual": 17, "matter": 18, "memori": [18, 31, 33], "method": 10, "methodologi": [13, 28], "mix": [7, 8], "mode": [6, 28, 31], "model": [6, 7, 13, 15, 18, 20, 28, 32], "modul": [2, 10, 20, 28], "motiv": 10, "multi": 32, "multipl": 31, "multistream": 20, "nativ": 18, "nchw": 18, "nchw16c": 18, "new": [6, 7, 34], "nhwc": 18, "node": 31, "non": 33, "note": 34, "numa": 33, "numactl": 33, "number": [30, 31, 33], "omp_num_thread": 33, "omp_thread_limit": 33, "onednn": [18, 33], "onli": [6, 29], "op": 8, "openmp": [31, 33], "oper": [7, 18, 19, 28], "optim": [2, 7, 10, 13, 15, 19, 28, 29], "origin": 10, "other": 34, "output": 20, "overview": [17, 28, 30, 31, 33], "path": 8, "pattern": 13, "perform": [20, 26, 30, 32, 33, 34], "physic": 31, "pin": 32, "precis": [7, 8, 28], "preload": 20, "prepar": 15, "prerequisit": 11, "primit": [18, 33], "privat": 17, "process": 17, "product": 30, "promot": 8, "prototyp": [2, 6, 7, 10, 11, 12, 14, 16, 22, 28], "pseudocod": 29, "public": 3, "pytest": 5, "python": [5, 6, 7], "pytorch": [0, 1, 5, 15, 18, 32], "qconfig": 15, "quant": 22, "quantiz": [2, 6, 7, 15, 16, 29], "quick": 23, "recip": [16, 20, 22], "refer": 8, "regist": [18, 32], "regress": 26, "releas": 34, "requir": [17, 20], "resnet50": [6, 32], "result": [26, 34], "runtim": [2, 7, 20, 26], "scale": 32, "scenario": 29, "script": 31, "search": 14, "select": 17, "serial": 32, "serv": 32, "set": 20, "sgd": 21, "shape": 26, "sheet": 4, "singl": [28, 31], "smooth": [6, 16, 22], "smoothquant": 29, "softwar": [30, 33], "space": 14, "specif": [8, 17], "split": 21, "start": [23, 25, 32], "static": [6, 15], "statu": 18, "stochast": 21, "stride": 18, "struct": 17, "structur": [20, 33], "stub": 17, "support": [1, 8, 10], "target": 18, "task": 20, "tcmalloc": [31, 33], "tensor": 18, "test": 5, "thi": [32, 33], "through": 16, "throughput": 31, "tip": 5, "torch": 7, "torchdynamo": [6, 26], "torchscript": [6, 8], "torchserv": 32, "trace": 10, "train": [6, 8], "troubleshoot": 26, "tune": [14, 16, 22, 33], "type": [8, 28], "uniform": 33, "unit": 5, "us": [7, 8, 9, 10, 13, 16, 20, 31], "usag": [10, 11, 12, 14, 16, 20, 26, 29, 31], "user": 14, "v": 31, "v1": 30, "vec": 17, "verifi": 28, "version": 30, "vi": 31, "via": 28, "vii": 31, "viii": 31, "weight": [6, 29], "what": [18, 34], "widest": 8, "wip": 18, "woq": 29, "worker": 32, "write": [5, 18], "xyz": 17, "xyzkrnl": 17, "your": 31, "your_conf_fil": 14, "your_python_script": 14}}) \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/api_doc.html b/cpu/2.4.0+cpu/tutorials/api_doc.html new file mode 100644 index 000000000..4ac7e2f36 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/api_doc.html @@ -0,0 +1,1723 @@ + + + + + + + API Documentation — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

API Documentation

+
+

General

+

ipex.optimize is generally used for generic PyTorch models.

+
+
+ipex.optimize(model, dtype=None, optimizer=None, level='O1', inplace=False, conv_bn_folding=None, linear_bn_folding=None, weights_prepack=None, replace_dropout_with_identity=None, optimize_lstm=None, split_master_weight_for_bf16=None, fuse_update_step=None, auto_kernel_selection=None, sample_input=None, graph_mode=None, concat_linear=None)
+

Apply optimizations at Python frontend to the given model (nn.Module), as +well as the given optimizer (optional). If the optimizer is given, +optimizations will be applied for training. Otherwise, optimization will be +applied for inference. Optimizations include conv+bn folding (for +inference only), weight prepacking and so on.

+

Weight prepacking is a technique to accelerate performance of oneDNN +operators. In order to achieve better vectorization and cache reuse, onednn +uses a specific memory layout called blocked layout. Although the +calculation itself with blocked layout is fast enough, from memory usage +perspective it has drawbacks. Running with the blocked layout, oneDNN +splits one or several dimensions of data into blocks with fixed size each +time the operator is executed. More details information about oneDNN data +mermory format is available at oneDNN manual. +To reduce this overhead, data will be converted to predefined block shapes +prior to the execution of oneDNN operator execution. In runtime, if the data +shape matches oneDNN operator execution requirements, oneDNN won’t perform +memory layout conversion but directly go to calculation. Through this +methodology, called weight prepacking, it is possible to avoid runtime +weight data format convertion and thus increase performance.

+
+
Parameters:
+
    +
  • model (torch.nn.Module) – User model to apply optimizations on.

  • +
  • dtype (torch.dtype) – Only works for torch.bfloat16 and torch.half a.k.a torch.float16. +Model parameters will be casted to torch.bfloat16 or torch.half +according to dtype of settings. The default value is None, meaning do nothing. +Note: Data type conversion is only applied to nn.Conv2d, nn.Linear +and nn.ConvTranspose2d for both training and inference cases. For +inference mode, additional data type conversion is applied to the weights +of nn.Embedding and nn.LSTM.

  • +
  • optimizer (torch.optim.Optimizer) – User optimizer to apply optimizations +on, such as SGD. The default value is None, meaning inference case.

  • +
  • level (string) – "O0" or "O1". No optimizations are applied with +"O0". The optimizer function just returns the original model and +optimizer. With "O1", the following optimizations are applied: +conv+bn folding, weights prepack, dropout removal (inferenc model), +master weight split and fused optimizer update step (training model). +The optimization options can be further overridden by setting the +following options explicitly. The default value is "O1".

  • +
  • inplace (bool) – Whether to perform inplace optimization. Default value is +False.

  • +
  • conv_bn_folding (bool) – Whether to perform conv_bn folding. It only +works for inference model. The default value is None. Explicitly +setting this knob overwrites the configuration set by level knob.

  • +
  • linear_bn_folding (bool) – Whether to perform linear_bn folding. It only +works for inference model. The default value is None. Explicitly +setting this knob overwrites the configuration set by level knob.

  • +
  • weights_prepack (bool) – Whether to perform weight prepack for convolution +and linear to avoid oneDNN weights reorder. The default value is +None. Explicitly setting this knob overwrites the configuration +set by level knob. For now, XPU doesn’t support weights prepack.

  • +
  • replace_dropout_with_identity (bool) – Whether to replace nn.Dropout +with nn.Identity. If replaced, the aten::dropout won’t be +included in the JIT graph. This may provide more fusion opportunites +on the graph. This only works for inference model. The default value +is None. Explicitly setting this knob overwrites the configuration +set by level knob.

  • +
  • optimize_lstm (bool) – Whether to replace nn.LSTM with IPEX LSTM +which takes advantage of oneDNN kernels to get better performance. +The default value is None. Explicitly setting this knob +overwrites the configuration set by level knob.

  • +
  • split_master_weight_for_bf16 (bool) – Whether to split master weights +update for BF16 training. This saves memory comparing to master +weight update solution. Split master weights update methodology +doesn’t support all optimizers. The default value is None. The +default value is None. Explicitly setting this knob overwrites +the configuration set by level knob.

  • +
  • fuse_update_step (bool) – Whether to use fused params update for training +which have better performance. It doesn’t support all optimizers. +The default value is None. Explicitly setting this knob +overwrites the configuration set by level knob.

  • +
  • sample_input (tuple or torch.Tensor) – Whether to feed sample input data to ipex.optimize. The shape of +input data will impact the block format of packed weight. If not feed a sample +input, Intel® Extension for PyTorch* will pack the weight per some predefined heuristics. +If feed a sample input with real input shape, Intel® Extension for PyTorch* can get +best block format.

  • +
  • auto_kernel_selection (bool) – Different backends may have +different performances with different dtypes/shapes. Default value +is False. Intel® Extension for PyTorch* will try to optimize the +kernel selection for better performance if this knob is set to +True. You might get better performance at the cost of extra memory usage. +The default value is None. Explicitly setting this knob overwrites the +configuration set by level knob.

  • +
  • graph_mode – (bool) [prototype]: It will automatically apply a combination of methods +to generate graph or multiple subgraphs if True. The default value is False.

  • +
  • concat_linear (bool) – Whether to perform concat_linear. It only +works for inference model. The default value is None. Explicitly +setting this knob overwrites the configuration set by level knob.

  • +
+
+
Returns:
+

Model and optimizer (if given) modified according to the level knob +or other user settings. conv+bn folding may take place and +dropout may be replaced by identity. In inference scenarios, +convolutuon, linear and lstm will be replaced with the optimized +counterparts in Intel® Extension for PyTorch* (weight prepack for +convolution and linear) for good performance. In bfloat16 or float16 scenarios, +parameters of convolution and linear will be casted to bfloat16 or float16 dtype.

+
+
+
+

Warning

+

Please invoke optimize function BEFORE invoking DDP in distributed +training scenario.

+

The optimize function deepcopys the original model. If DDP is invoked +before optimize function, DDP is applied on the origin model, rather +than the one returned from optimize function. In this case, some +operators in DDP, like allreduce, will not be invoked and thus may cause +unpredictable accuracy loss.

+
+

Examples

+
>>> # bfloat16 inference case.
+>>> model = ...
+>>> model.load_state_dict(torch.load(PATH))
+>>> model.eval()
+>>> optimized_model = ipex.optimize(model, dtype=torch.bfloat16)
+>>> # running evaluation step.
+>>> # bfloat16 training case.
+>>> optimizer = ...
+>>> model.train()
+>>> optimized_model, optimized_optimizer = ipex.optimize(model, dtype=torch.bfloat16, optimizer=optimizer)
+>>> # running training step.
+
+
+

torch.xpu.optimize() is an alternative of optimize API in Intel® Extension for PyTorch*, +to provide identical usage for XPU device only. The motivation of adding this alias is +to unify the coding style in user scripts base on torch.xpu modular.

+

Examples

+
>>> # bfloat16 inference case.
+>>> model = ...
+>>> model.load_state_dict(torch.load(PATH))
+>>> model.eval()
+>>> optimized_model = torch.xpu.optimize(model, dtype=torch.bfloat16)
+>>> # running evaluation step.
+>>> # bfloat16 training case.
+>>> optimizer = ...
+>>> model.train()
+>>> optimized_model, optimized_optimizer = torch.xpu.optimize(model, dtype=torch.bfloat16, optimizer=optimizer)
+>>> # running training step.
+
+
+
+ +

ipex.llm.optimize is used for Large Language Models (LLM).

+
+
+ipex.llm.optimize(model, optimizer=None, dtype=torch.float32, inplace=False, device='cpu', quantization_config=None, qconfig_summary_file=None, low_precision_checkpoint=None, sample_inputs=None, deployment_mode=True, cache_weight_for_large_batch=False)
+

Apply optimizations at Python frontend to the given transformers model (nn.Module). +This API focus on transformers models, especially for generation tasks inference.

+

Well supported model family with full functionalities: +Llama, GPT-J, GPT-Neox, OPT, Falcon, Bloom, CodeGen, Baichuan, ChatGLM, GPTBigCode, +T5, Mistral, MPT, Mixtral, StableLM, QWen, Git, Llava, Yuan, Phi, Whisper.

+

For the model that is not in the scope of supported model family above, will try to +apply default ipex.optimize transparently to get benifits (not include quantizations, +only works for dtypes of torch.bfloat16 and torch.half and torch.float).

+
+
Parameters:
+
    +
  • model (torch.nn.Module) – User model to apply optimizations.

  • +
  • optimizer (torch.optim.Optimizer) – User optimizer to apply optimizations +on, such as SGD. The default value is None, meaning inference case.

  • +
  • dtype (torch.dtype) – Now it works for torch.bfloat16, torch.half and torch.float. +The default value is torch.float. When working with quantization, it means the mixed dtype with quantization.

  • +
  • inplace (bool) – Whether to perform inplace optimization. Default value is False.

  • +
  • device (str) – Specifying the device on which the optimization will be performed. +Can be either ‘cpu’ or ‘xpu’ (‘xpu’ is not applicable for cpu only packages). The default value is ‘cpu’.

  • +
  • quantization_config (object) – Defining the IPEX quantization recipe (Weight only quant or static quant). +Default value is None. Once used, meaning using IPEX quantizatization model for model.generate().

  • +
  • qconfig_summary_file (str) – Path to the IPEX static quantization config json file. +Default value is None. Work with quantization_config under static quantization use case. +Need to do IPEX static quantization calibration and generate this file.

  • +
  • low_precision_checkpoint (dict or tuple of dict) – For weight only quantization with INT4 weights. +If it’s a dict, it should be the state_dict of checkpoint (.pt) generated by GPTQ, etc. +If a tuple is provided, it should be (checkpoint, checkpoint config), +where checkpoint is the state_dict and checkpoint config is dict specifying +keys of weight/scale/zero point/bias in the state_dict. +The default config is {‘weight_key’: ‘packed_weight’, ‘scale_key’: ‘scale’, +‘zero_point_key’: ‘packed_zp’, bias_key: ‘bias’}. Change the values of the dict to make a custom config. +Weights shape should be N by K and they are quantized to UINT4 and compressed along K, then stored as +torch.int32. Zero points are also UINT4 and stored as INT32. Scales and bias are floating point values. +Bias is optional. If bias is not in state dict, bias of the original model is used. +Default value is None.

  • +
  • sample_inputs (Tuple tensors) – sample inputs used for model quantization or torchscript. +Default value is None, and for well supported model, we provide this sample inputs automaticlly.

  • +
  • deployment_mode (bool) – Whether to apply the optimized model for deployment of model generation. +It means there is no need to further apply optimization like torchscirpt. Default value is True.

  • +
  • cache_weight_for_large_batch (bool) – Whether to cache the dedicated weight for large batch to speed up +its inference (e.g., prefill phase) with extra memory usage. It is only valid for non-quantization cases +where dtype = bfloat16 and weight-only quantization cases where lowp-mode=BF16/INT8. In other cases, an +error will be raised. Default value is False.

  • +
+
+
Returns:
+

Optimized model object for model.generate(), also workable with model.forward

+
+
+
+

Warning

+

Please invoke ipex.llm.optimize function AFTER invoking DeepSpeed in Tensor Parallel +inference scenario.

+
+

Examples

+
>>> # bfloat16 generation inference case.
+>>> model = ...
+>>> model.load_state_dict(torch.load(PATH))
+>>> model.eval()
+>>> optimized_model = ipex.llm.optimize(model, dtype=torch.bfloat16)
+>>> optimized_model.generate()
+
+
+
+ +
+
+class ipex.verbose(level)
+

On-demand oneDNN verbosing functionality

+

To make it easier to debug performance issues, oneDNN can dump verbose +messages containing information like kernel size, input data size and +execution duration while executing the kernel. The verbosing functionality +can be invoked via an environment variable named DNNL_VERBOSE. However, +this methodology dumps messages in all steps. Those are a large amount of +verbose messages. Moreover, for investigating the performance issues, +generally taking verbose messages for one single iteration is enough.

+

This on-demand verbosing functionality makes it possible to control scope +for verbose message dumping. In the following example, verbose messages +will be dumped out for the second inference only.

+
import intel_extension_for_pytorch as ipex
+model(data)
+with ipex.verbose(ipex.verbose.VERBOSE_ON):
+    model(data)
+
+
+
+
Parameters:
+

level

Verbose level

+
    +
  • VERBOSE_OFF: Disable verbosing

  • +
  • VERBOSE_ON: Enable verbosing

  • +
  • VERBOSE_ON_CREATION: Enable verbosing, including oneDNN kernel creation

  • +
+

+
+
+
+ +
+
+

LLM Module Level Optimizations (Prototype)

+

Module level optimization APIs are provided for optimizing customized LLMs.

+
+
+class ipex.llm.modules.LinearSilu(linear)
+

Applies a linear transformation to the input data, and then apply PyTorch SILU +(see https://pytorch.org/docs/stable/generated/torch.nn.functional.silu.html) +on the result:

+
result = torch.nn.functional.silu(linear(input))
+
+
+
+
Parameters:
+

linear (torch.nn.Linear module) – the original torch.nn.Linear +module to be fused with silu.

+
+
+
+
Shape:

Input and output shapes are the same as torch.nn.Linear.

+
+
+

Examples

+
>>> # module init:
+>>> linear_module = torch.nn.Linear(4096, 4096)
+>>> ipex_fusion = ipex.llm.modules.LinearSilu(linear_module)
+>>> # module forward:
+>>> input = torch.randn(4096, 4096)
+>>> result = ipex_fusion(input)
+
+
+
+ +
+
+class ipex.llm.modules.LinearSiluMul(linear)
+

Applies a linear transformation to the input data, then apply PyTorch SILU +(see https://pytorch.org/docs/stable/generated/torch.nn.functional.silu.html) +on the result, and multiplies the result by other:

+
result = torch.nn.functional.silu(linear(input)) * other
+
+
+
+
Parameters:
+

linear (torch.nn.Linear module) – the original torch.nn.Linear module to +be fused with silu and mul.

+
+
+
+
Shape:

Input and output shapes are the same as torch.nn.Linear.

+
+
+

Examples

+
>>> # module init:
+>>> linear_module = torch.nn.Linear(4096, 4096)
+>>> ipex_fusion = ipex.llm.modules.LinearSiluMul(linear_module)
+>>> # module forward:
+>>> input = torch.randn(4096, 4096)
+>>> other = torch.randn(4096, 4096)
+>>> result = ipex_fusion(input, other)
+
+
+
+ +
+
+class ipex.llm.modules.Linear2SiluMul(linear_s, linear_m)
+

Applies two linear transformation to the input data (linear_s and +linear_m), then apply PyTorch SILU +(see https://pytorch.org/docs/stable/generated/torch.nn.functional.silu.html) +on the result from linear_s, and multiplies the result from linear_m:

+
result = torch.nn.functional.silu(linear_s(input)) * linear_m(input)
+
+
+
+
Parameters:
+
    +
  • linear_s (torch.nn.Linear module) – the original torch.nn.Linear +module to be fused with silu.

  • +
  • linear_m (torch.nn.Linear module) – the original torch.nn.Linear +module to be fused with mul.

  • +
+
+
+
+
Shape:

Input and output shapes are the same as torch.nn.Linear.

+
+
+

Examples

+
>>> # module init:
+>>> linear_s_module = torch.nn.Linear(4096, 4096)
+>>> linear_m_module = torch.nn.Linear(4096, 4096)
+>>> ipex_fusion = ipex.llm.modules.Linear2SiluMul(linear_s_module, linear_m_module)
+>>> # module forward:
+>>> input = torch.randn(4096, 4096)
+>>> result = ipex_fusion(input)
+
+
+
+ +
+
+class ipex.llm.modules.LinearRelu(linear)
+

Applies a linear transformation to the input data, and then apply PyTorch RELU +(see https://pytorch.org/docs/stable/generated/torch.nn.functional.relu.html) +on the result:

+
result = torch.nn.functional.relu(linear(input))
+
+
+
+
Parameters:
+

linear (torch.nn.Linear module) – the original torch.nn.Linear module +to be fused with relu.

+
+
+
+
Shape:

Input and output shapes are the same as torch.nn.Linear.

+
+
+

Examples

+
>>> # module init:
+>>> linear_module = torch.nn.Linear(4096, 4096)
+>>> ipex_fusion = ipex.llm.modules.LinearRelu(linear_module)
+>>> # module forward:
+>>> input = torch.randn(4096, 4096)
+>>> result = ipex_fusion(input)
+
+
+
+ +
+
+class ipex.llm.modules.LinearNewGelu(linear)
+

Applies a linear transformation to the input data, and then apply NewGELUActivation +(see https://github.com/huggingface/transformers/blob/main/src/transformers/activations.py#L50) +on the result:

+
result = NewGELUActivation(linear(input))
+
+
+
+
Parameters:
+

linear (torch.nn.Linear module) – the original torch.nn.Linear module +to be fused with new_gelu.

+
+
+
+
Shape:

Input and output shapes are the same as torch.nn.Linear.

+
+
+

Examples

+
>>> # module init:
+>>> linear_module = torch.nn.Linear(4096, 4096)
+>>> ipex_fusion = ipex.llm.modules.LinearNewGelu(linear_module)
+>>> # module forward:
+>>> input = torch.randn(4096, 4096)
+>>> result = ipex_fusion(input)
+
+
+
+ +
+
+class ipex.llm.modules.LinearGelu(linear)
+

Applies a linear transformation to the input data, and then apply PyTorch GELU +(see https://pytorch.org/docs/stable/generated/torch.nn.functional.gelu.html) +on the result:

+
result = torch.nn.functional.gelu(linear(input))
+
+
+
+
Parameters:
+

linear (torch.nn.Linear module) – the original torch.nn.Linear +module to be fused with gelu.

+
+
+
+
Shape:

Input and output shapes are the same as torch.nn.Linear.

+
+
+

Examples

+
>>> # module init:
+>>> linear_module = torch.nn.Linear(4096, 4096)
+>>> ipex_fusion = ipex.llm.modules.LinearGelu(linear_module)
+>>> # module forward:
+>>> input = torch.randn(4096, 4096)
+>>> result = ipex_fusion(input)
+
+
+
+ +
+
+class ipex.llm.modules.LinearMul(linear)
+

Applies a linear transformation to the input data, and then multiplies +the result by other:

+
result = linear(input) * other
+
+
+
+
Parameters:
+

linear (torch.nn.Linear module) – the original torch.nn.Linear module +to be fused with mul.

+
+
+
+
Shape:

Input and output shapes are the same as torch.nn.Linear.

+
+
+

Examples

+
>>> # module init:
+>>> linear_module = torch.nn.Linear(4096, 4096)
+>>> ipex_fusion = ipex.llm.modules.LinearMul(linear_module)
+>>> # module forward:
+>>> input = torch.randn(4096, 4096)
+>>> other = torch.randn(4096, 4096)
+>>> result = ipex_fusion(input, other)
+
+
+
+ +
+
+class ipex.llm.modules.LinearAdd(linear)
+

Applies a linear transformation to the input data, +and then add the result by other:

+
result = linear(input) + other
+
+
+
+
Parameters:
+

linear (torch.nn.Linear module) – the original torch.nn.Linear +module to be fused with add.

+
+
+
+
Shape:

Input and output shapes are the same as torch.nn.Linear.

+
+
+

Examples

+
>>> # module init:
+>>> linear_module = torch.nn.Linear(4096, 4096)
+>>> ipex_fusion = ipex.llm.modules.LinearAdd(linear_module)
+>>> # module forward:
+>>> input = torch.randn(4096, 4096)
+>>> other = torch.randn(4096, 4096)
+>>> result = ipex_fusion(input, other)
+
+
+
+ +
+
+class ipex.llm.modules.LinearAddAdd(linear)
+

Applies a linear transformation to the input data, +and then add the result by other_1 and other_2:

+
result = linear(input) + other_1 + other_2
+
+
+
+
Parameters:
+

linear (torch.nn.Linear module) – the original torch.nn.Linear +module to be fused with add and add.

+
+
+
+
Shape:

Input and output shapes are the same as torch.nn.Linear.

+
+
+

Examples

+
>>> # module init:
+>>> linear_module = torch.nn.Linear(4096, 4096)
+>>> ipex_fusion = ipex.llm.modules.LinearAddAdd(linear_module)
+>>> # module forward:
+>>> input = torch.randn(4096, 4096)
+>>> other_1 = torch.randn(4096, 4096)
+>>> other_2 = torch.randn(4096, 4096)
+>>> result = ipex_fusion(input, other_1, other_2)
+
+
+
+ +
+
+class ipex.llm.modules.RotaryEmbedding(max_position_embeddings: int, pos_embd_dim: int, base=10000, backbone: str | None = None, extra_rope_config: dict | None = None)
+

[module init and forward] Applies RotaryEmbedding (see https://huggingface.co/papers/2104.09864) +on the query or key before their multi-head attention computation.

+

module init

+
+
Parameters:
+
+
+
+

forward()

+
+
Parameters:
+
    +
  • input (torch.Tensor) – input to be applied with position embeddings, +taking shape of [batch size, sequence length, num_head/num_kv_head, head_dim] +(as well as the output shape).

  • +
  • position_ids (torch.Tensor) – the according position_ids for the input. +The shape should be [batch size, sequence length. In some cases, +there is only one element which the past_kv_length, and position id +can be constructed by past_kv_length + current_position.

  • +
  • num_head (int) – head num from the input shape.

  • +
  • head_dim (int) – head dim from the input shape.

  • +
  • offset (int) – the offset value. e.g., GPT-J 6B/ChatGLM, cos/sin is applied to the neighboring 2 elements, +so the offset is 1. For llama, cos/sin is applied to the neighboring rotary_dim elements, +so the offset is rotary_dim/2.

  • +
  • rotary_ndims (int) – the rotary dimension. e.g., 64 for GPTJ. head size for LLama.

  • +
+
+
+

Examples

+
>>> # module init:
+>>> rope_module = ipex.llm.modules.RotaryEmbedding(2048, 64, base=10000, backbone="GPTJForCausalLM")
+>>> # forward:
+>>> query = torch.randn(1, 32, 16, 256)
+>>> position_ids  = torch.arange(32).unsqueeze(0)
+>>> query_rotery = rope_module(query, position_ids, 16, 256, 1, 64)
+
+
+

[Direct function call] This module also provides a .apply_function function call +to be used on query and key at the same time without initializing the module +(assume rotary embedding sin/cos values are provided).

+

apply_function()

+
+
Parameters:
+
    +
  • query (torch.Tensor) – inputs to be applied with position embeddings, taking shape of +[batch size, sequence length, num_head/num_kv_head, head_dim] +or [num_tokens, num_head/num_kv_head, head_dim] (as well as the output shape).

  • +
  • key (torch.Tensor) – inputs to be applied with position embeddings, taking shape of +[batch size, sequence length, num_head/num_kv_head, head_dim] +or [num_tokens, num_head/num_kv_head, head_dim] (as well as the output shape).

  • +
  • sin/cos (torch.Tensor) – [num_tokens, rotary_dim] the sin/cos value tensor generated to be applied on query/key.

  • +
  • rotary_ndims (int) – the rotary dimension. e.g., 64 for GPTJ. head size for LLama.

  • +
  • head_dim (int) – head dim from the input shape.

  • +
  • rotary_half (bool) – if False. e.g., GPT-J 6B/ChatGLM, cos/sin is applied to the neighboring 2 elements, +so the offset is 1. +if True, e.g., for llama, cos/sin is applied to the neighboring rotary_dim elements, +so the offset is rotary_dim/2.

  • +
  • position_ids (torch.Tensor) – Default is None and optional if sin/cos is provided. the according position_ids +for the input. The shape should be [batch size, sequence length].

  • +
+
+
Returns:
+

[batch size, sequence length, num_head/num_kv_head, head_dim] +or [num_tokens, num_head/num_kv_head, head_dim].

+
+
Return type:
+

query, key (torch.Tensor)

+
+
+
+ +
+
+class ipex.llm.modules.RMSNorm(hidden_size: int, eps: float = 1e-06, weight: Tensor | None = None)
+

[module init and forward] Applies RMSnorm on the input (hidden states). +(see https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L76)

+

module init

+
+
Parameters:
+
    +
  • hidden_size (int) – the size of the hidden states.

  • +
  • eps (float) – the variance_epsilon to apply RMSnorm, default using 1e-6.

  • +
  • weight (torch.Tensor) – the weight to apply RMSnorm, default None +and will use torch.ones(hidden_size).

  • +
+
+
+

forward()

+
+
Parameters:
+

hidden_states (torch.Tensor) – input to be applied RMSnorm, usually taking shape of +[batch size, sequence length, hidden_size] (as well as the output shape).

+
+
+

Examples

+
>>> # module init:
+>>> rmsnorm_module = ipex.llm.modules.RMSNorm(4096)
+>>> # forward:
+>>> input = torch.randn(1, 32, 4096)
+>>> result = rmsnorm_module(input)
+
+
+

[Direct function call] This module also provides a .apply_function function +call to apply RMSNorm without initializing the module.

+

apply_function()

+
+
Parameters:
+
    +
  • hidden_states (torch.Tensor) – the input tensor to apply RMSNorm.

  • +
  • weight (torch.Tensor) – the weight to apply RMSnorm.

  • +
  • eps (float) – the variance_epsilon to apply RMSnorm.

  • +
+
+
+
+ +
+
+class ipex.llm.modules.FastLayerNorm(normalized_shape: Tuple[int, ...], eps: float, weight: Tensor, bias: Tensor | None = None)
+

[module init and forward] Applies PyTorch Layernorm (see https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) +on the input (hidden states).

+

module init

+
+
Parameters:
+
    +
  • normalized_shape ((int or list) or torch.Size)

  • +
  • eps (float) – a value added to the denominator for numerical stability.

  • +
  • weight (torch.Tensor) – the weight of Layernorm to apply normalization.

  • +
  • bias (torch.Tensor) – an additive bias for normalization.

  • +
+
+
+

forward()

+
+
Parameters:
+

hidden_states (torch.Tensor) – input to be applied Layernorm, usually taking shape of +[batch size, sequence length, hidden_size] (as well as the output shape).

+
+
+

Examples

+
>>> # module init:
+>>> layernorm = torch.nn.LayerNorm(4096)
+>>> layernorm_module = ipex.llm.modules.FastLayerNorm(4096, eps=1e-05, weight=layernorm.weight, bias=layernorm.bias)
+>>> # forward:
+>>> input = torch.randn(1, 32, 4096)
+>>> result = layernorm_module(input)
+
+
+

[Direct function call] This module also provides a .apply_function function call to apply fast layernorm +without initializing the module.

+

apply_function()

+
+
Parameters:
+
    +
  • hidden_states (torch.Tensor) – the input tensor to apply normalization.

  • +
  • normalized_shape (int or list) or torch.Size)

  • +
  • weight (torch.Tensor) – the weight to apply normalization.

  • +
  • bias (torch.Tensor) – an additive bias for normalization.

  • +
  • eps (float) – a value added to the denominator for numerical stability.

  • +
+
+
+
+ +
+
+class ipex.llm.modules.IndirectAccessKVCacheAttention(text_max_length=2048)
+

kv_cache is used to reduce computation for Decoder layer but it also brings memory overheads, +for example, when using beam search, the kv_cache should be reordered according to the latest beam +idx and the current key/value should also be concat with kv_cache in the attention layer to get entire +context to do scale dot product. When the sequence is very long, the memory overhead will be the +performance bottleneck. This module provides an Indirect Access KV_cache(IAKV), Firstly, IAKV pre-allocates +buffers(key and value use different buffers) to store all key/value hidden states and beam index information. +It can use beam index history to decide which beam should be used by a timestamp and this information will +generate an offset to access the kv_cache buffer.

+

Data Format:

+

The shape of the pre-allocated key(value) buffer is [max_seq, beam*batch, head_num, head_size], +the hidden state of key/value which is the shape of [beam*batch, head_num, head_size] is stored token by token. +All beam idx information of every timestamp is also stored in a Tensor with the shape of [max_seq, beam*batch].

+

module init

+
+
Parameters:
+

text_max_length (int) – the max length of kv cache to be used +for generation (allocate the pre-cache buffer).

+
+
+

forward()

+
+
Parameters:
+
    +
  • query (torch.Tensor) – Query tensor; shape: (beam*batch, seq_len, head_num, head_dim).

  • +
  • key (torch.Tensor) – Key tensor; shape: (beam*batch, seq_len, head_num, head_dim).

  • +
  • value (torch.Tensor) – Value tensor; shape: (beam*batch, seq_len, head_num, head_dim).

  • +
  • scale_attn (float) – scale used by the attention layer. should be sqrt(head_size).

  • +
  • layer_past (tuple(torch.Tensor)) –

    tuple(seq_info, key_cache, value_cache, beam-idx).

    +
      +
    • key_cache: key cache tensor, shape: (max_seq, beam*batch, head_num, head_dim);

    • +
    • value_cache: value cache tensor, shape: (max_seq, beam*batch, head_num, head_dim);

    • +
    • beam-idx: history beam idx, shape:(max_seq, beam*batch);

    • +
    • seq_info: Sequence info tensor, shape:(1, 1, max_seq, max_seq).

    • +
    +

  • +
  • head_mask (torch.Tensor) – Head mask tensor which is not supported by kernel yet.

  • +
  • attention_mask (torch.Tensor) – Attention mask information.

  • +
+
+
Returns:
+

Weighted value which is the output of scale dot product. +shape (beam*batch, seq_len, head_num, head_size).

+

attn_weights: The output tensor of the first matmul in scale dot product +which is not supported by kernel now.

+

new_layer_past: updated layer_past (seq_info, key_cache, value_cache, beam-idx).

+

+
+
Return type:
+

attn_output

+
+
+

Notes

+

How to reorder KV cache when using the format of IndirectAccessKVCacheAttention (e.g., on llama model +see https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1318)

+
def _reorder_cache(
+    self, past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor
+) -> Tuple[Tuple[torch.Tensor]]:
+    if (
+        len(past_key_values[0]) == 4 and past_key_values[0][0].shape[-1] == 1
+    ):
+        for layer_past in past_key_values:
+            layer_past[3][layer_past[0].size(-2) - 1] = beam_idx
+        return past_key_values
+
+
+

[Direct function call] This module also provides a .apply_function function call +to apply IndirectAccessKVCacheAttention without initializing the module.

+

The parameters of apply_function() are the same as the forward() call.

+
+ +
+
+class ipex.llm.modules.PagedAttention
+

This module follows the API of two class methods as [vLLM](https://blog.vllm.ai/2023/06/20/vllm.html) +to enable the paged attention kernel in and use the layout of (num_blocks, self.block_size, num_heads, head_size) +for key/value cache. The basic logic as following figure. Firstly, The DRAM buffer which includes num_blocks +are pre-allocated to store key or value cache. For every block, block_size tokens can be stored. In the forward +pass, the cache manager will firstly allocate some slots from this buffer and use reshape_and_cache API to store +the key/value and then use single_query_cached_kv_attention API to do the scale-dot-product of MHA. +The block is basic allocation unit of paged attention and the token intra-block are stored one-by-one. +The block tables are used to map the logical block of sequence into the physical block.

+

[class method]: reshape_and_cache +ipex.llm.modules.PagedAttention.reshape_and_cache(key, value, key_cache, value_cache, slot_mapping) +This operator is used to store the key/value token states into the pre-allcated kv_cache buffers of paged attention.

+
+
Parameters:
+
    +
  • key (torch.Tensor) – The keytensor. The shape should be [num_seqs, num_heads, head_size].

  • +
  • value (torch.Tensor) – The value tensor. The shape should be [num_seqs, num_heads, head_size].

  • +
  • key_cache (torch.Tensor) – The pre-allocated buffer to store the key cache. +The shape should be [num_blocks, block_size, num_heads, head_size].

  • +
  • value_cache (torch.Tensor) – The pre-allocated buffer to store the value cache. +The shape should be [num_blocks, block_size, num_heads, head_size].

  • +
  • slot_mapping (torch.Tensor) – It stores the position to store the key/value in the pre-allocated buffers. +The shape should be the number of sequences. For sequence i, the slot_mapping[i] // block_number +can get the block index, and the slot_mapping % block_size can get the offset of this block.

  • +
+
+
+

[class method]: single_query_cached_kv_attention

+
ipex.llm.modules.PagedAttention.single_query_cached_kv_attention(
+                                                    out,
+                                                    query,
+                                                    key_cache,
+                                                    value_cache,
+                                                    head_mapping,
+                                                    scale,
+                                                    block_tables,
+                                                    context_lens,
+                                                    block_size,
+                                                    max_context_len,
+                                                    alibi_slopes
+                                                    )
+
+
+

This operator is used to be calculated the scale-dot-product based on the paged attention.

+
+
Parameters:
+
    +
  • out (torch.Tensor) – The output tensor with shape of [num_seqs, num_heads, head_size], +where the num_seqs is the number of the sequence in this batch. The num_heads +means the number of query head. head_size means the head dimension.

  • +
  • query (torch.Tensor) – The query tensor. The shape should be [num_seqs, num_heads, head_size].

  • +
  • key_cache (torch.Tensor) – The pre-allocated buffer to store the key cache. +The shape should be [num_blocks, block_size, num_heads, head_size].

  • +
  • value_cache (torch.Tensor) – The pre-allocated buffer to store the value cache. +The shape should be [num_blocks, block_size, num_heads, head_size].

  • +
  • head_mapping (torch.Tensor) – The mapping from the query head to the kv head. +The shape should be the number of query heads.

  • +
  • scale (float) – The scale used by the scale-dot-product. +In general, it is: float(1.0 / (head_size ** 0.5)).

  • +
  • block_tables – (torch.Tensor): The mapping table used to mapping the logical sequence +to the physical sequence. The shape should be [num_seqs, max_num_blocks_per_seq].

  • +
  • context_lens (torch.Tensor) – The sequence length for every sequence. The size is [num_seqs].

  • +
  • block_size (int) – The block size which means the number of token in every block.

  • +
  • max_context_len (int) – The max sequence length.

  • +
  • alibi_slopes (torch.Tensor, optinal) – which is the alibi slope with the shape of (num_heads).

  • +
+
+
+
+ +
+
+class ipex.llm.modules.VarlenAttention
+

[module init and forward] Applies PyTorch scaled_dot_product_attention on the inputs of query, key and value +(see https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html), +and accept the variant (different) sequence length among the query, key and value.

+

This module does not have args for module init.

+

forward()

+
+
Parameters:
+
    +
  • query (torch.Tensor) – shape [query_tokens, num_head, head_size], +where tokens is total sequence length among batch size.

  • +
  • key (torch.Tensor) – shape [key_tokens, num_head, head_size], +where tokens is total sequence length among batch size.

  • +
  • value (torch.Tensor) – shape [value_tokens, num_head, head_size], +where tokens is total sequence length among batch size.

  • +
  • out (torch.Tensor) – buffer to get the results, the shape is the same as query.

  • +
  • seqlen_q (torch.Tensor) – shape [batch_size + 1], points the +current query_tokens among total sequence length.

  • +
  • seqlen_k (torch.Tensor) – shape [batch_size + 1], points the +current key_tokens among total sequence length.

  • +
  • max_seqlen_q (int) – max/total sequence length of query.

  • +
  • max_seqlen_k (int) – max/total sequence length of key.

  • +
  • pdropout (float) – dropout probability; if greater than 0.0, +dropout is applied, default is 0.0.

  • +
  • softmax_scale (float) – scaling factor applied is prior to softmax.

  • +
  • is_causal (bool) – whether to apply causal attention masking, default is True.

  • +
+
+
+

Examples

+
>>> # module init:
+>>> varlenAttention_module = ipex.llm.modules.VarlenAttention()
+>>> # forward:
+>>> query = torch.randn(32, 16, 256)
+>>> key = torch.randn(32, 16, 256)
+>>> value = torch.randn(32, 16, 256)
+>>> out = torch.emply_like(query)
+>>> seqlen_q = torch.tensor(1)
+>>> seqlen_k = torch.tensor(1)
+>>> max_seqlen_q = 1
+>>> max_seqlen_k  = 1
+>>> pdropout = 0.0
+>>> softmax_scale  = 0.5
+>>> varlenAttention_module(query, key, value, out, seqlen_q, seqlen_k, max_seqlen_q, max_seqlen_k, pdropout, softmax_scale)
+
+
+

[Direct function call] This module also provides a .apply_function +function call to apply VarlenAttention without initializing the module.

+

The parameters of apply_function() are the same as the forward() call.

+
+ +
+
+ipex.llm.functional.rotary_embedding(query: Tensor, key: Tensor, sin: Tensor, cos: Tensor, rotary_dim: int, rotary_half: bool, position_ids: Tensor | None = None)
+

Applies RotaryEmbedding (see https://huggingface.co/papers/2104.09864) +on the query ` or `key before their multi-head attention computation.

+
+
Parameters:
+
    +
  • query (torch.Tensor) – inputs to be applied with position embeddings, +taking shape of [batch size, sequence length, num_head/num_kv_head, head_dim] +or [num_tokens, num_head/num_kv_head, head_dim] (as well as the output shape).

  • +
  • key (torch.Tensor) – inputs to be applied with position embeddings, +taking shape of [batch size, sequence length, num_head/num_kv_head, head_dim] +or [num_tokens, num_head/num_kv_head, head_dim] (as well as the output shape).

  • +
  • sin/cos (torch.Tensor) – [num_tokens, rotary_dim] the sin/cos value tensor +generated to be applied on query/key.

  • +
  • rotary_ndims (int) – the rotary dimension. e.g., 64 for GPTJ. head size for LLama.

  • +
  • head_dim (int) – head dim from the input shape.

  • +
  • rotary_half (bool) –

    if False. e.g., GPT-J 6B/ChatGLM, cos/sin is applied to the neighboring 2 elements, +so the offset is 1.

    +

    if True, e.g., for llama, cos/sin is applied to the neighboring rotary_dim elements, +so the offset is rotary_dim/2.

    +

  • +
  • position_ids (torch.Tensor) – Default is None and optional if sin/cos is provided. +The according position_ids for the input. The shape should be [batch size, sequence length].

  • +
+
+
+
+
Return

query, key (torch.Tensor): [batch size, sequence length, num_head/num_kv_head, head_dim] +or [num_tokens, num_head/num_kv_head, head_dim].

+
+
+
+ +
+
+ipex.llm.functional.rms_norm(hidden_states: Tensor, weight: Tensor, eps: float)
+

Applies RMSnorm on the input (hidden states). +(see https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L76)

+
+
Parameters:
+
    +
  • hidden_states (torch.Tensor) – the input tensor to apply RMSNorm.

  • +
  • weight (torch.Tensor) – the weight to apply RMSnorm.

  • +
  • eps (float) – the variance_epsilon to apply RMSnorm.

  • +
+
+
+
+ +
+
+ipex.llm.functional.fast_layer_norm(hidden_states: Tensor, normalized_shape: Tuple[int, ...], weight: Tensor, bias: Tensor, eps: float)
+

Applies PyTorch Layernorm (see https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) +on the input (hidden states).

+
+
Parameters:
+
    +
  • hidden_states (torch.Tensor) – the input tensor to apply normalization.

  • +
  • normalized_shape (int or list) or torch.Size) – expected input of size.

  • +
  • weight (torch.Tensor) – the weight to apply normalization.

  • +
  • bias (torch.Tensor) – an additive bias for normalization.

  • +
  • eps (float) – a value added to the denominator for numerical stability.

  • +
+
+
+
+ +
+
+ipex.llm.functional.indirect_access_kv_cache_attention(query: Tensor, key: Tensor, value: Tensor, scale_attn: float, layer_past: Tuple[Tensor] | None = None, head_mask: Tuple[Tensor] | None = None, attention_mask: Tuple[Tensor] | None = None, alibi: Tensor | None = None, add_casual_mask: bool | None = True, seq_info: Tensor | None = None, text_max_length: int | None = 0)
+

kv_cache is used to reduce computation for Decoder layer but it also brings memory overheads, +for example, when using beam search, the kv_cache should be reordered according to the latest beam +idx and the current key/value should also be concat with kv_cache in the attention layer to get entire +context to do scale dot product. When the sequence is very long, the memory overhead will be the +performance bottleneck. This module provides an Indirect Access KV_cache(IAKV), Firstly, IAKV pre-allocates +buffers(key and value use different buffers) to store all key/value hidden states and beam index information. +It can use beam index history to decide which beam should be used by a timestamp and this information will +generate an offset to access the kv_cache buffer.

+

Data Format:

+

The shape of the pre-allocated key(value) buffer is [max_seq, beam*batch, head_num, head_size], +the hidden state of key/value which is the shape of [beam*batch, head_num, head_size] is stored token by token. +All beam idx information of every timestamp is also stored in a Tensor with the shape of [max_seq, beam*batch].

+
+
Parameters:
+
    +
  • query (torch.Tensor) – Query tensor; shape: (beam*batch, seq_len, head_num, head_dim).

  • +
  • key (torch.Tensor) – Key tensor; shape: (beam*batch, seq_len, head_num, head_dim).

  • +
  • value (torch.Tensor) – Value tensor; shape: (beam*batch, seq_len, head_num, head_dim).

  • +
  • scale_attn (float) – scale used by the attention layer. should be the sqrt(head_size).

  • +
  • layer_past (tuple(torch.Tensor)) –

    tuple(seq_info, key_cache, value_cache, beam-idx).

    +
      +
    • key_cache: key cache tensor, shape: (max_seq, beam*batch, head_num, head_dim);

    • +
    • value_cache: value cache tensor, shape: (max_seq, beam*batch, head_num, head_dim);

    • +
    • beam-idx: history beam idx, shape:(max_seq, beam*batch);

    • +
    • seq_info: Sequence info tensor, shape:(1, 1, max_seq, max_seq).

    • +
    +

  • +
  • head_mask (torch.Tensor) – Head mask tensor which is not supported by kernel yet.

  • +
  • attention_mask (torch.Tensor) – Attention mask information.

  • +
  • text_max_length (int) – the max length of kv cache to be used for generation +(allocate the pre-cache buffer).

  • +
+
+
Returns:
+

weighted value which is the output of scale dot product. +shape (beam*batch, seq_len, head_num, head_size).

+

attn_weights: the output tensor of the first matmul in scale dot product +which is not supported by kernel now.

+

new_layer_past: updated layer_past (seq_info, key_cache, value_cache, beam-idx).

+

+
+
Return type:
+

attn_output

+
+
+

Notes

+

How to reorder KV cache when using the format of IndirectAccessKVCacheAttention (e.g., on llama model +see https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1318)

+
def _reorder_cache(
+    self, past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor
+) -> Tuple[Tuple[torch.Tensor]]:
+    if (
+        len(past_key_values[0]) == 4 and past_key_values[0][0].shape[-1] == 1
+    ):
+        for layer_past in past_key_values:
+            layer_past[3][layer_past[0].size(-2) - 1] = beam_idx
+        return past_key_values
+
+
+
+ +
+
+ipex.llm.functional.varlen_attention(query: Tensor, key: Tensor, value: Tensor, out: Tensor, seqlen_q: Tensor, seqlen_k: Tensor, max_seqlen_q: int, max_seqlen_k: int, pdropout: float, softmax_scale: float, zero_tensors: bool, is_causal: bool, return_softmax: bool, gen_: Generator)
+

Applies PyTorch scaled_dot_product_attention on the inputs of query, key and value +(see https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html), +and accept the variant (different) sequence length among the query, key and value.

+

This module does not have args for module init.

+

forward()

+
+
Parameters:
+
    +
  • query (torch.Tensor) – shape [query_tokens, num_head, head_size], +where tokens is total sequence length among batch size.

  • +
  • key (torch.Tensor) – shape [key_tokens, num_head, head_size], +where tokens is total sequence length among batch size.

  • +
  • value (torch.Tensor) – shape [value_tokens, num_head, head_size], +where tokens is total sequence length among batch size.

  • +
  • out (torch.Tensor) – buffer to get the results, the shape is the same as query.

  • +
  • seqlen_q (torch.Tensor) – shape [batch_size + 1], +points the current query_tokens among total sequence length.

  • +
  • seqlen_k (torch.Tensor) – shape [batch_size + 1], +points the current key_tokens among total sequence length.

  • +
  • max_seqlen_q (int) – max/total sequence length of query.

  • +
  • max_seqlen_k (int) – max/total sequence length of key.

  • +
  • pdropout (float) – dropout probability; if greater than 0.0, dropout is applied, default is 0.0.

  • +
  • softmax_scale (float) – scaling factor applied is prior to softmax.

  • +
  • is_causal (bool) – whether to apply causal attention masking, default is True.

  • +
+
+
+
+ +
+
+

Fast Bert (Prototype)

+
+
+ipex.fast_bert(model, dtype=torch.float32, optimizer=None, unpad=False)
+

Use TPP to speedup training/inference. fast_bert API is still a prototype +feature and now only optimized for bert model.

+
+
Parameters:
+
    +
  • model (torch.nn.Module) – User model to apply optimizations on.

  • +
  • dtype (torch.dtype) – Only works for torch.bfloat16 and torch.float . +The default value is torch.float.

  • +
  • optimizer (torch.optim.Optimizer) – User optimizer to apply optimizations +on, such as SGD. The default value is None, meaning inference case.

  • +
  • unpad (bool) – Unpad the squence to reduce the sparsity.

  • +
  • seed (string) – The seed used for the libxsmm kernel. In general it should be same +to the torch.seed

  • +
+
+
+
+

Note

+

Currently ipex.fast_bert API is well optimized for training tasks. +It works for inference tasks, though, please use the ipex.optimize +API with TorchScript to achieve the peak performance.

+
+
+

Warning

+

Please invoke fast_bert function AFTER loading weights to model via +model.load_state_dict(torch.load(PATH)).

+
+
+

Warning

+

This API can’t be used when you have applied the ipex.optimize.

+
+
+

Warning

+

Please invoke optimize function BEFORE invoking DDP in distributed +training scenario.

+
+

Examples

+
>>> # bfloat16 inference case.
+>>> model = ...
+>>> model.load_state_dict(torch.load(PATH))
+>>> model.eval()
+>>> optimized_model = ipex.fast_bert(model, dtype=torch.bfloat16)
+>>> # running evaluation step.
+>>> # bfloat16 training case.
+>>> optimizer = ...
+>>> model.train()
+>>> optimized_model, optimized_optimizer = ipex.fast_bert(model, dtype=torch.bfloat16,
+        optimizer=optimizer, unpad=True, seed=args.seed)
+>>> # running training step.
+
+
+
+ +
+
+

Graph Optimization

+
+
+ipex.enable_onednn_fusion(enabled)
+

Enables or disables oneDNN fusion functionality. If enabled, oneDNN +operators will be fused in runtime, when intel_extension_for_pytorch +is imported.

+
+
Parameters:
+

enabled (bool) – Whether to enable oneDNN fusion functionality or not. +Default value is True.

+
+
+

Examples

+
>>> import intel_extension_for_pytorch as ipex
+>>> # to enable the oneDNN fusion
+>>> ipex.enable_onednn_fusion(True)
+>>> # to disable the oneDNN fusion
+>>> ipex.enable_onednn_fusion(False)
+
+
+
+ +
+
+

Quantization

+
+
+ipex.quantization.get_smooth_quant_qconfig_mapping(alpha=0.5, act_observer=None, act_ic_observer=None, wei_observer=None, wei_ic_observer=None, share_weight_observers=True)
+

Configuration with SmoothQuant for static quantization of large language models (LLM) +For SmoothQuant, see https://arxiv.org/pdf/2211.10438.pdf

+
+
Parameters:
+
    +
  • alpha – Hyper-parameter for SmoothQuant.

  • +
  • act_observer – Observer for activation of ops other than nn.Linear. +HistogramObserver by default. For nn.Linear with SmoothQuant +enabled, q-param is calculated based on act_ic_observer’s and +wei_ic_observer’s min/max. It is not affected by this argument. +Example: torch.ao.quantization.MinMaxObserver

  • +
  • act_ic_observer – Per-input-channel Observer for activation. +For nn.Linear with SmoothQuant enabled only. +PerChannelMinMaxObserver by default. +Example: torch.ao.quantization.PerChannelMinMaxObserver.with_args(ch_axis=1)

  • +
  • wei_observer – Observer for weight of all weighted ops. +For nn.Linear with SmoothQuant enabled, it calculates q-params +after applying scaling factors. PerChannelMinMaxObserver by +default. +Example: torch.ao.quantization.PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)

  • +
  • wei_ic_observer – Per-input-channel Observer for weight. +For nn.Linear with SmoothQuant enabled only. +PerChannelMinMaxObserver by default. +Example: torch.ao.quantization.PerChannelMinMaxObserver.with_args(ch_axis=1)

  • +
+
+
Returns:
+

torch.ao.quantization.QConfig

+
+
+
+ +
+
+ipex.quantization.get_weight_only_quant_qconfig_mapping(*, weight_dtype: int = WoqWeightDtype.INT8, lowp_mode: int = WoqLowpMode.NONE, act_quant_mode: int = WoqActQuantMode.PER_IC_BLOCK, group_size: int = -1)
+

Configuration for weight-only quantization (WOQ) for LLM.

+
+
Parameters:
+
    +
  • weight_dtype – Data type for weight, WoqWeightDtype.INT8/INT4/NF4, etc.

  • +
  • lowp_mode

    specify the lowest precision data type for computation. Data types +that has even lower precision won’t be used. +Not necessarily related to activation or weight dtype.

    +
      +
    • NONE(0): Use the activation data type for computation.

    • +
    • FP16(1): Use float16 (a.k.a. half) as the lowest precision for computation.

    • +
    • BF16(2): Use bfloat16 as the lowest precision for computation.

    • +
    • INT8(3): Use INT8 as the lowest precision for computation. +Activation is quantized to int8 at runtime in this case.

    • +
    +

    Note that lowp_mode=INT8(3) is only available when weight_dtype=INT4. +In other cases, it will fall back to lowp_mode=BF16(2).

    +

  • +
  • act_quant_mode

    Quantization granularity of activation. It only works for lowp_mode=INT8. +It has no effect in other cases. The tensor is divided into groups, and +each group is quantized with its own quantization parameters. +Suppose the activation has shape batch_size by input_channel (IC).

    +
      +
    • PER_TENSOR(0): Use the same quantization parameters for the entire tensor.

    • +
    • PER_IC_BLOCK(1): Tensor is divided along IC with group size = IC_BLOCK.

    • +
    • PER_BATCH(2): Tensor is divided along batch_size with group size = 1.

    • +
    • PER_BATCH_IC_BLOCK(3): Tenosr is divided into blocks of 1 x IC_BLOCK.

    • +
    +

    Note that IC_BLOCK is determined by group_size automatically.

    +

  • +
  • group_size

    Control quantization granularity along input channel (IC) dimension of weight. +Must be a positive power of 2 (i.e., 2^k, k > 0) or -1.

    +
    If group_size = -1:
    +    If act_quant_mode = PER_TENSOR ro PER_BATCH:
    +        No grouping along IC for both activation and weight
    +    If act_quant_mode = PER_IC_BLOCK or PER_BATCH_IC_BLOCK:
    +        No grouping along IC for weight. For activation,
    +        IC_BLOCK is determined automatically by IC.
    +If group_size > 0:
    +    act_quant_mode can be any. If act_quant_mode is PER_IC_BLOCK
    +    or PER_BATCH_IC_BLOCK, weight is grouped along IC by group_size.
    +    The IC_BLOCK for activation is determined by group_size automatically.
    +    Each group has its own quantization parameters.
    +
    +
    +

  • +
+
+
+
+ +
+
+ipex.quantization.prepare(model, configure, example_inputs=None, inplace=False, bn_folding=True, example_kwarg_inputs=None)
+

Prepare an FP32 torch.nn.Module model to do calibration or to convert to quantized model.

+
+
Parameters:
+
    +
  • model (torch.nn.Module) – The FP32 model to be prepared.

  • +
  • configure (torch.quantization.qconfig.QConfig) – The observer settings about activation and weight.

  • +
  • example_inputs (tuple or torch.Tensor) – A tuple of example inputs that +will be passed to the function while running to init quantization state. Only one of this +argument or example_kwarg_inputs should be specified.

  • +
  • inplace – (bool): It will change the given model in-place if True. The default value is False. +Note that if bn_folding is True, the returned model is a different object from the +original model even if inplace=True. So, with the following code +>>> prepared_model = prepare(original_model, …, inplace=True) +please use prepared_model for later operations to avoid unexpected behaviors.

  • +
  • bn_folding – (bool): whether to perform conv_bn and linear_bn folding. +The default value is True.

  • +
  • example_kwarg_inputs (dict) – A dict of example inputs that will be passed to the function while +running to init quantization state. Only one of this argument or example_inputs should be +specified.

  • +
+
+
Returns:
+

torch.nn.Module

+
+
+
+ +
+
+ipex.quantization.convert(model, inplace=False)
+

Convert an FP32 prepared model to a model which will automatically insert fake quant +before a quantizable module or operator.

+
+
Parameters:
+
    +
  • model (torch.nn.Module) – The FP32 model to be convert.

  • +
  • inplace – (bool): It will change the given model in-place if True. The default value is False.

  • +
+
+
Returns:
+

torch.nn.Module

+
+
+
+ +

Prototype API, introduction is avaiable at feature page.

+
+
+ipex.quantization.autotune(model, calib_dataloader, calib_func=None, eval_func=None, op_type_dict=None, smoothquant_args=None, sampling_sizes=None, accuracy_criterion=None, tuning_time=0)
+

Automatic accuracy-driven tuning helps users quickly find out the advanced recipe for INT8 inference.

+
+
Parameters:
+
    +
  • model (torch.nn.Module) – fp32 model.

  • +
  • calib_dataloader (generator) – set a dataloader for calibration.

  • +
  • calib_func (function) – calibration function for post-training static quantization. It is optional. +This function takes “model” as input parameter and executes entire inference process.

  • +
  • eval_func (function) – set a evaluation function. This function takes “model” as input parameter +executes entire evaluation process with self contained metrics, and returns an accuracy value +which is a scalar number. The higher the better.

  • +
  • op_type_dict (dict) – Tuning constraints on optype-wise for advance user to reduce tuning space. +User can specify the quantization config by op type:

  • +
  • smoothquant_args (dict) – smoothquant recipes for automatic global alpha tuning, and automatic +layer-by-layer alpha tuning for the best INT8 accuracy.

  • +
  • sampling_sizes (list) – a list of sample sizes used in calibration, where the tuning algorithm would explore from. +The default value is [100].

  • +
  • accuracy_criterion ({accuracy_criterion_type(str, 'relative' or 'absolute') – accuracy_criterion_value(float)}): +set the maximum allowed accuracy loss, either relative or absolute. The default value is {'relative': 0.01}.

  • +
  • tuning_time (seconds) – tuning timeout. The default value is 0 which means early stop.

  • +
+
+
Returns:
+

the prepared model loaded qconfig after tuning.

+
+
Return type:
+

prepared_model (torch.nn.Module)

+
+
+
+ +
+
+

CPU Runtime

+
+
+ipex.cpu.runtime.is_runtime_ext_enabled()
+

Helper function to check whether runtime extension is enabled or not.

+
+
Parameters:
+

None (None) – None

+
+
Returns:
+

+
Whether the runtime exetension is enabled or not. If the

Intel OpenMP Library is preloaded, this API will return True. +Otherwise, it will return False.

+
+
+

+
+
Return type:
+

bool

+
+
+
+ +
+
+class ipex.cpu.runtime.CPUPool(core_ids: list | None = None, node_id: int | None = None)
+

An abstraction of a pool of CPU cores used for intra-op parallelism.

+
+
Parameters:
+
    +
  • core_ids (list) – A list of CPU cores’ ids used for intra-op parallelism.

  • +
  • node_id (int) – A numa node id with all CPU cores on the numa node. +node_id doesn’t work if core_ids is set.

  • +
+
+
Returns:
+

Generated +ipex.cpu.runtime.CPUPool object.

+
+
Return type:
+

ipex.cpu.runtime.CPUPool

+
+
+
+ +
+
+class ipex.cpu.runtime.pin(cpu_pool: CPUPool)
+

Apply the given CPU pool to the master thread that runs the scoped code +region or the function/method def.

+
+
Parameters:
+

cpu_pool (ipex.cpu.runtime.CPUPool) – ipex.cpu.runtime.CPUPool object, contains +all CPU cores used by the designated operations.

+
+
Returns:
+

Generated +ipex.cpu.runtime.pin object which can be used +as a with context or a function decorator.

+
+
Return type:
+

ipex.cpu.runtime.pin

+
+
+
+ +
+
+class ipex.cpu.runtime.MultiStreamModuleHint(*args, **kwargs)
+

MultiStreamModuleHint is a hint to MultiStreamModule about how to split the inputs +or concat the output. Each argument should be None, with type of int or a container +which containes int or None such as: (0, None, …) or [0, None, …]. If the argument +is None, it means this argument will not be split or concat. If the argument is with +type int, its value means along which dim this argument will be split or concat.

+
+
Parameters:
+
    +
  • *args – Variable length argument list.

  • +
  • **kwargs – Arbitrary keyword arguments.

  • +
+
+
Returns:
+

Generated +ipex.cpu.runtime.MultiStreamModuleHint object.

+
+
Return type:
+

ipex.cpu.runtime.MultiStreamModuleHint

+
+
+
+ +
+
+class ipex.cpu.runtime.MultiStreamModule(model, num_streams: int | str = 'AUTO', cpu_pool: ~ipex.cpu.runtime.cpupool.CPUPool = <ipex.cpu.runtime.cpupool.CPUPool object>, concat_output: bool = True, input_split_hint: ~ipex.cpu.runtime.multi_stream.MultiStreamModuleHint = <ipex.cpu.runtime.multi_stream.MultiStreamModuleHint object>, output_concat_hint: ~ipex.cpu.runtime.multi_stream.MultiStreamModuleHint = <ipex.cpu.runtime.multi_stream.MultiStreamModuleHint object>)
+

MultiStreamModule supports inference with multi-stream throughput mode.

+

If the number of cores inside cpu_pool is divisible by num_streams, +the cores will be allocated equally to each stream. If the number of cores +inside cpu_pool is not divisible by num_streams with remainder N, +one extra core will be allocated to the first N streams. We suggest to set +the num_streams as divisor of core number inside cpu_pool.

+

If the inputs’ batchsize is larger than and divisible by num_streams, +the batchsize will be allocated equally to each stream. If batchsize is not +divisible by num_streams with remainder N, one extra piece will be +allocated to the first N streams. If the inputs’ batchsize is less than +num_streams, only the first batchsize’s streams are used with mini batch +as one. We suggest to set inputs’ batchsize larger than and divisible by +num_streams. If you don’t want to tune the num of streams and leave it +as “AUTO”, we suggest to set inputs’ batchsize larger than and divisible by +number of cores.

+
+
Parameters:
+
    +
  • model (torch.jit.ScriptModule or torch.nn.Module) – The input model.

  • +
  • num_streams (Union[int, str]) – Number of instances (int) or “AUTO” (str). “AUTO” means the stream number +will be selected automatically. Although “AUTO” usually provides a +reasonable performance, it may still not be optimal for some cases which +means manual tuning for number of streams is needed for this case.

  • +
  • cpu_pool (ipex.cpu.runtime.CPUPool) – An +ipex.cpu.runtime.CPUPool object, contains +all CPU cores used to run multi-stream inference.

  • +
  • concat_output (bool) – A flag indicates whether the output of each +stream will be concatenated or not. The default value is True. Note: +if the output of each stream can’t be concatenated, set this flag to +false to get the raw output (a list of each stream’s output).

  • +
  • input_split_hint (MultiStreamModuleHint) – Hint to MultiStreamModule about +how to split the inputs.

  • +
  • output_concat_hint (MultiStreamModuleHint) – Hint to MultiStreamModule about +how to concat the outputs.

  • +
+
+
Returns:
+

Generated +ipex.cpu.runtime.MultiStreamModule object.

+
+
Return type:
+

ipex.cpu.runtime.MultiStreamModule

+
+
+
+ +
+
+class ipex.cpu.runtime.Task(module, cpu_pool: CPUPool)
+

An abstraction of computation based on PyTorch module and is scheduled +asynchronously.

+
+
Parameters:
+
    +
  • model (torch.jit.ScriptModule or torch.nn.Module) – The input module.

  • +
  • cpu_pool (ipex.cpu.runtime.CPUPool) – An +ipex.cpu.runtime.CPUPool object, contains +all CPU cores used to run Task asynchronously.

  • +
+
+
Returns:
+

Generated +ipex.cpu.runtime.Task object.

+
+
Return type:
+

ipex.cpu.runtime.Task

+
+
+
+ +
+
+ipex.cpu.runtime.get_core_list_of_node_id(node_id)
+

Helper function to get the CPU cores’ ids of the input numa node.

+
+
Parameters:
+

node_id (int) – Input numa node id.

+
+
Returns:
+

List of CPU cores’ ids on this numa node.

+
+
Return type:
+

list

+
+
+
+ +
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/blogs_publications.html b/cpu/2.4.0+cpu/tutorials/blogs_publications.html new file mode 100644 index 000000000..f6c433cd1 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/blogs_publications.html @@ -0,0 +1,188 @@ + + + + + + + Blogs & Publications — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Blogs & Publications

+ +
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/cheat_sheet.html b/cpu/2.4.0+cpu/tutorials/cheat_sheet.html new file mode 100644 index 000000000..e113cf597 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/cheat_sheet.html @@ -0,0 +1,216 @@ + + + + + + + Cheat Sheet — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Cheat Sheet

+

Get started with Intel® Extension for PyTorch* using the following commands:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DescriptionCommand
Basic CPU Installationpython -m pip install intel_extension_for_pytorch
Import Intel® Extension for PyTorch*import intel_extension_for_pytorch as ipex
Capture a Verbose Log (Command Prompt)export ONEDNN_VERBOSE=1
Optimization During Trainingmodel = ...
optimizer = ...
model.train()
model, optimizer = ipex.optimize(model, optimizer=optimizer)
Optimization During Inferencemodel = ...
model.eval()
model = ipex.optimize(model)
Optimization Using the Low-Precision Data Type bfloat16
During Training (Default FP32)
model = ...
optimizer = ...
model.train()

model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)

with torch.no_grad():
with torch.cpu.amp.autocast():
model(data)
Optimization Using the Low-Precision Data Type bfloat16
During Inference (Default FP32)
model = ...
model.eval()

model = ipex.optimize(model, dtype=torch.bfloat16)

with torch.cpu.amp.autocast():
model(data)
[Prototype] Fast BERT Optimizationfrom transformers import BertModel
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()

model = ipex.fast_bert(model, dtype=torch.bfloat16)
Run CPU Launch Script (Command Prompt):
Automate Configuration Settings for Performance
ipexrun [knobs] <your_pytorch_script> [args]
[Prototype] Run HyperTune to perform hyperparameter/execution configuration searchpython -m intel_extension_for_pytorch.cpu.hypertune --conf-file <your_conf_file> <your_python_script> [args]
[Prototype] Enable Graph capturemodel = …
model.eval()
model = ipex.optimize(model, graph_mode=True)
Post-Training INT8 Quantization (Static)model = …
model.eval()
data = …

qconfig = ipex.quantization.default_static_qconfig

prepared_model = ipex.quantization.prepare(model, qconfig, example_inputs=data, anyplace=False)

for d in calibration_data_loader():
prepared_model(d)

converted_model = ipex.quantization.convert(prepared_model)
Post-Training INT8 Quantization (Dynamic)model = …
model.eval()
data = …

qconfig = ipex.quantization.default_dynamic_qconfig

prepared_model = ipex.quantization.prepare(model, qconfig, example_inputs=data)

converted_model = ipex.quantization.convert(prepared_model)
[Prototype] Post-Training INT8 Quantization (Tuning Recipe):model = …
model.eval()
data = …

qconfig = ipex.quantization.default_static_qconfig

prepared_model = ipex.quantization.prepare(model, qconfig, example_inputs=data, inplace=False)

tuned_model = ipex.quantization.autotune(prepared_model, calibration_data_loader, eval_function, sampling_sizes=[100],
accuracy_criterion={'relative': .01}, tuning_time=0)

convert_model = ipex.quantization.convert(tuned_model)
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/contribution.html b/cpu/2.4.0+cpu/tutorials/contribution.html new file mode 100644 index 000000000..d24bcdaf8 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/contribution.html @@ -0,0 +1,352 @@ + + + + + + + Contribution — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Contribution

+
+

Contributing to Intel® Extension for PyTorch*

+

Thank you for your interest in contributing to Intel® Extension for PyTorch*. Before you begin writing code, it is important that you share your intention to contribute with the team, based on the type of contribution:

+
    +
  1. You want to propose a new feature and implement it.

    +
      +
    • Post about your intended feature in a GitHub issue, and we shall discuss the design and implementation. Once we agree that the plan looks good, go ahead and implement it.

    • +
    +
  2. +
  3. You want to implement a feature or bug-fix for an outstanding issue.

    +
      +
    • Search for your issue in the GitHub issue list.

    • +
    • Pick an issue and comment that you’d like to work on the feature or bug-fix.

    • +
    • If you need more context on a particular issue, ask and we shall provide.

    • +
    +
  4. +
+

Once you implement and test your feature or bug-fix, submit a Pull Request to https://github.com/intel/intel-extension-for-pytorch.

+
+
+

Developing Intel® Extension for PyTorch*

+

A full set of instructions on installing Intel® Extension for PyTorch* from source is in the Installation document.

+

To develop on your machine, here are some tips:

+
    +
  1. Uninstall all existing Intel® Extension for PyTorch* installs. You may need to run pip uninstall intel_extension_for_pytorch multiple times. You’ll know intel_extension_for_pytorch is fully uninstalled when you see WARNING: Skipping intel_extension_for_pytorch as it is not installed. (You should only have to pip uninstall a few times, but you can always uninstall with timeout or in a loop if you’re feeling lazy.)

    +
    yes | pip uninstall intel_extension_for_pytorch
    +
    +
    +
  2. +
  3. Clone a copy of Intel® Extension for PyTorch* from source:

    +
    git clone https://github.com/intel/intel-extension-for-pytorch.git
    +cd intel-extension-for-pytorch
    +
    +
    +

    If you already have Intel® Extension for PyTorch* from source, update it:

    +
    git pull --rebase
    +git submodule sync --recursive
    +git submodule update --init --recursive --jobs 0
    +
    +
    +
  4. +
  5. Install Intel® Extension for PyTorch* in develop mode:

    +

    Replace:

    +
    python setup.py install
    +
    +
    +

    with:

    +
    python setup.py develop
    +
    +
    +

    This mode will symlink the Python files from the current local source tree into the Python install. After than, if you modify a Python file, you do not need to reinstall PyTorch again. This is especially useful if you are only changing Python files.

    +

    For example:

    +
      +
    • Install local Intel® Extension for PyTorch* in develop mode

    • +
    • modify your Python file intel_extension_for_pytorch/__init__.py (for example)

    • +
    • test functionality

    • +
    +
  6. +
+

You do not need to repeatedly install after modifying Python files (.py). However, you would need to reinstall if you modify a Python interface (.pyi, .pyi.in) or non-Python files (.cpp, .cc, .cu, .h, etc.).

+

If you want to reinstall, make sure that you uninstall Intel® Extension for PyTorch* first by running pip uninstall intel_extension_for_pytorch until you see WARNING: Skipping intel_extension_for_pytorch as it is not installed; next run python setup.py clean. After that, you can install in develop mode again.

+
+

Tips and Debugging

+
    +
  • Cmake must be installed before installing Intel® Extension for PyTorch*. If youre developing on MacOS or Linux, We recommend installing Cmake with Homebrew with brew install cmake.

  • +
  • Our setup.py requires Python >= 3.6

  • +
  • If you run into errors when running python setup.py develop, here are some debugging steps:

    +
      +
    1. Run printf '#include <stdio.h>\nint main() { printf("Hello World");}'|clang -x c -; ./a.out to make sure your CMake works and can compile this simple Hello World program without errors.

    2. +
    3. Remove your build directory. The setup.py script compiles binaries into the build folder and caches many details along the way. This saves time the next time you build. If you’re running into issues, you can always rm -rf build from the toplevel pytorch directory and start over.

    4. +
    5. If you have made edits to the Intel® Extension for PyTorch* repo, commit any change you’d like to keep and clean the repo with the following commands (note that clean really removes all untracked files and changes.):

      +
      git submodule deinit -f .
      +git clean -xdf
      +python setup.py clean
      +git submodule update --init --recursive --jobs 0 # very important to sync the submodules
      +python setup.py develop                          # then try running the command again
      +
      +
      +
    6. +
    7. The main step within python setup.py develop is running make from the build directory. If you want to experiment with some environment variables, you can pass them into the command:

      +
      ENV_KEY1=ENV_VAL1[, ENV_KEY2=ENV_VAL2]* python setup.py develop
      +
      +
      +
    8. +
    +
  • +
+
+
+
+

Unit testing

+
+

Python Unit Testing

+

All PyTorch test suites are located in the test folder and start with test_. Run individual test suites using the command python test/cpu/FILENAME.py, where FILENAME represents the file containing the test suite you wish to run.

+

For example, to run all the TorchScript JIT tests (located at test/cpu/test_jit.py), you would run:

+
python test/cpu/test_jit.py
+
+
+

You can narrow down what you’re testing even further by specifying the name of an individual test with TESTCLASSNAME.TESTNAME. Here, TESTNAME is the name of the test you want to run, and TESTCLASSNAME is the name of the class in which it is defined.

+

Let’s say you want to run test_Sequential, which is defined as part of the TestJit class in test/cpu/test_jit.py. Your command would be:

+
python test/test_jit.py TestJit.test_Sequential
+
+
+

The expecttest and hypothesis libraries must be installed to run the tests. mypy is an optional dependency, and pytest may help run tests more selectively. All these packages can be installed with conda or pip.

+
+
+

Better local unit tests with pytest

+

We don’t officially support pytest, but it works well with our unittest tests and offers a number of useful features for local developing. Install it via pip install pytest.

+

If you want to run only tests that contain a specific substring, you can use the -k flag:

+
pytest test/cpu/test_nn.py -k Loss -v
+
+
+

The above is an example of testing a change to all Loss functions: this command runs tests such as TestNN.test_BCELoss and TestNN.test_MSELoss and can be useful to save keystrokes.

+
+
+

Local linting

+

You can run the same linting steps that are used in CI locally via make:

+
# Lint all files
+make lint -j 6  # run lint (using 6 parallel jobs)
+
+# Lint only the files you have changed
+make quicklint -j 6
+
+
+

These jobs may require extra dependencies that aren’t dependencies of Intel® Extension for PyTorch* itself, so you can install them via this command, which you should only have to run once:

+
make setup_lint
+
+
+

To run a specific linting step, use one of these targets or see the Makefile for a complete list of options.

+
# Check for tabs, trailing newlines, etc.
+make quick_checks
+
+make flake8
+
+make mypy
+
+make cmakelint
+
+make clang-tidy
+
+
+

To run a lint only on changes, add the CHANGED_ONLY option:

+
make <name of lint> CHANGED_ONLY=--changed-only
+
+
+
+
+

C++ Unit Testing

+

Intel® Extension for PyTorch* offers tests located in the test/cpp folder. These tests are written in C++ and use the Google Test testing framework. After compiling Intel® Extension for PyTorch* from source, the test runner binaries will be written to the build/bin folder. The command to run one of these tests is ./build/bin/FILENAME --gtest_filter=TESTSUITE.TESTNAME, where TESTNAME is the name of the test you’d like to run and TESTSUITE is the suite that test is defined in.

+

For example, if you wanted to run the test MayContainAlias, which is part of the test suite ContainerAliasingTest in the file test/cpp/jit/test_alias_analysis.cpp, the command would be:

+
./build/bin/test_jit --gtest_filter=ContainerAliasingTest.MayContainAlias
+
+
+
+
+
+

Writing documentation

+

So you want to write some documentation for your code contribution and don’t know where to start?

+

Intel® Extension for PyTorch* uses Google style for formatting docstrings. Length of line inside docstrings block must be limited to 80 characters to fit into Jupyter documentation popups.

+
+

Building documentation

+

To build the documentation:

+
    +
  1. Build and install Intel® Extension for PyTorch* (as discussed above)

  2. +
  3. Install the prerequisites:

    +
    cd docs
    +pip install -r requirements.txt
    +
    +
    +
  4. +
  5. Generate the documentation HTML files. The generated files will be in docs/_build/html.

    +
    make clean
    +make html
    +
    +
    +
  6. +
+
+

Tips

+

The .rst source files live in docs/tutorials. Some of the .rst files pull in docstrings from Intel® Extension for PyTorch* Python code (for example, via the autofunction or autoclass directives). To shorten doc build times, it is helpful to remove the files you are not working on, only keeping the base index.rst file and the files you are editing. The Sphinx build will produce missing file warnings but will still complete.

+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/examples.html b/cpu/2.4.0+cpu/tutorials/examples.html new file mode 100644 index 000000000..f6ef11712 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/examples.html @@ -0,0 +1,1469 @@ + + + + + + + Examples — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Examples

+

These examples will guide you through using the Intel® Extension for PyTorch* on Intel CPUs.

+

You can also refer to the Features section to get the examples and usage instructions related to particular features.

+

The source code for these examples, as well as the feature examples, can be found in the GitHub source tree under the examples directory.

+
    +
  • Python examples demonstrate usage of Python APIs:

    + +
  • +
  • C++ examples demonstrate usage of C++ APIs

  • +
  • Intel® AI Reference Models provide out-of-the-box use cases, demonstrating the performance benefits achievable with Intel Extension for PyTorch*

  • +
+

Prerequisites: +Before running these examples, please note the following:

+
    +
  • Examples using the BFloat16 data type require machines with the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) BF16 and Intel® Advanced Matrix Extensions (Intel® AMX) BF16 instruction sets.

  • +
+
+

Python

+
+

Training

+
+

Distributed Training

+

Distributed training with PyTorch DDP is accelerated by oneAPI Collective Communications Library Bindings for Pytorch* (oneCCL Bindings for Pytorch*). The extension supports FP32 and BF16 data types. More detailed information and examples are available at the Github repo.

+

Note: You need to install torchvision Python package to run the following example.

+
import os
+import torch
+import torch.distributed as dist
+import torchvision
+import oneccl_bindings_for_pytorch as torch_ccl  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+LR = 0.001
+DOWNLOAD = True
+DATA = "datasets/cifar10/"
+
+os.environ["MASTER_ADDR"] = "127.0.0.1"
+os.environ["MASTER_PORT"] = "29500"
+os.environ["RANK"] = os.environ.get("PMI_RANK", 0)
+os.environ["WORLD_SIZE"] = os.environ.get("PMI_SIZE", 1)
+dist.init_process_group(backend="ccl", init_method="env://")
+rank = os.environ["RANK"]
+
+transform = torchvision.transforms.Compose(
+    [
+        torchvision.transforms.Resize((224, 224)),
+        torchvision.transforms.ToTensor(),
+        torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
+    ]
+)
+train_dataset = torchvision.datasets.CIFAR10(
+    root=DATA,
+    train=True,
+    transform=transform,
+    download=DOWNLOAD,
+)
+dist_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
+train_loader = torch.utils.data.DataLoader(
+    dataset=train_dataset, batch_size=128, sampler=dist_sampler
+)
+
+model = torchvision.models.resnet50()
+criterion = torch.nn.CrossEntropyLoss()
+optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9)
+model.train()
+model, optimizer = ipex.optimize(model, optimizer=optimizer)
+
+model = torch.nn.parallel.DistributedDataParallel(model)
+
+for batch_idx, (data, target) in enumerate(train_loader):
+    optimizer.zero_grad()
+    output = model(data)
+    loss = criterion(output, target)
+    loss.backward()
+    optimizer.step()
+    print("batch_id: {}".format(batch_idx))
+
+if rank == 0:
+    torch.save(
+        {
+            "model_state_dict": model.state_dict(),
+            "optimizer_state_dict": optimizer.state_dict(),
+        },
+        "checkpoint.pth",
+    )
+
+dist.destroy_process_group()
+print("Execution finished")
+
+
+
+
+
+

Inference

+

The optimize function of Intel® Extension for PyTorch* applies optimizations to the model, bringing additional performance boosts. For both computer vision workloads and NLP workloads, we recommend applying the optimize function against the model object.

+
+

Float32

+
+
Eager Mode
+
+
Resnet50
+

Note: You need to install torchvision Python package to run the following example.

+
import torch
+import torchvision.models as models
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(128, 3, 224, 224)
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model)
+######################################################  # noqa F401
+
+with torch.no_grad():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
BERT
+

Note: You need to install transformers Python package to run the following example.

+
import torch
+from transformers import BertModel
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 128
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model)
+######################################################  # noqa F401
+
+with torch.no_grad():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
+
TorchScript Mode
+

We recommend using Intel® Extension for PyTorch* with TorchScript for further optimizations.

+
+
Resnet50
+

Note: You need to install torchvision Python package to run the following example.

+
import torch
+import torchvision.models as models
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(128, 3, 224, 224)
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model)
+######################################################  # noqa F401
+
+with torch.no_grad():
+    d = torch.rand(128, 3, 224, 224)
+    model = torch.jit.trace(model, d)
+    model = torch.jit.freeze(model)
+
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
BERT
+

Note: You need to install transformers Python package to run the following example.

+
import torch
+from transformers import BertModel
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 128
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model)
+######################################################  # noqa F401
+
+with torch.no_grad():
+    d = torch.randint(vocab_size, size=[batch_size, seq_length])
+    model = torch.jit.trace(model, (d,), check_trace=False, strict=False)
+    model = torch.jit.freeze(model)
+
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
+
TorchDynamo Mode (Beta, NEW feature from 2.0.0)
+
+
Resnet50
+

Note: You need to install torchvision Python package to run the following example.

+
import torch
+import torchvision.models as models
+
+model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
+model.eval()
+data = torch.rand(128, 3, 224, 224)
+
+# Beta Feature
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model, weights_prepack=False)
+model = torch.compile(model, backend="ipex")
+######################################################  # noqa F401
+
+with torch.no_grad():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
BERT
+

Note: You need to install transformers Python package to run the following example.

+
import torch
+from transformers import BertModel
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 128
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+# Beta Feature
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model, weights_prepack=False)
+model = torch.compile(model, backend="ipex")
+######################################################  # noqa F401
+
+with torch.no_grad():
+    model(data)
+
+print("Execution finished")
+
+
+

Note: In TorchDynamo mode, since the native PyTorch operators like aten::convolution and aten::linear are well supported and optimized in ipex backend, we need to disable weights prepacking by setting weights_prepack=False in ipex.optimize().

+
+
+
+
+

BFloat16

+

The optimize function works for both Float32 and BFloat16 data type. For BFloat16 data type, set the dtype parameter to torch.bfloat16. +We recommend using Auto Mixed Precision (AMP) with BFloat16 data type.

+
+
Eager Mode
+
+
Resnet50
+

Note: You need to install torchvision Python package to run the following example.

+
import torch
+import torchvision.models as models
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(128, 3, 224, 224)
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model, dtype=torch.bfloat16)
+######################################################  # noqa F401
+
+# Note: bf16 inference requires amp.autocast() context  # noqa F401
+with torch.no_grad(), torch.cpu.amp.autocast():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
BERT
+

Note: You need to install transformers Python package to run the following example.

+
import torch
+from transformers import BertModel
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 128
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model, dtype=torch.bfloat16)
+######################################################  # noqa F401
+
+# Note: bf16 inference requires amp.autocast() context  # noqa F401
+with torch.no_grad(), torch.cpu.amp.autocast():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
+
TorchScript Mode
+

We recommend using Intel® Extension for PyTorch* with TorchScript for further optimizations.

+
+
Resnet50
+

Note: You need to install torchvision Python package to run the following example.

+
import torch
+import torchvision.models as models
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(128, 3, 224, 224)
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model, dtype=torch.bfloat16)
+######################################################  # noqa F401
+
+# Note: bf16 inference requires amp.autocast() context  # noqa F401
+with torch.no_grad(), torch.cpu.amp.autocast():
+    model = torch.jit.trace(model, torch.rand(128, 3, 224, 224))
+    model = torch.jit.freeze(model)
+
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
BERT
+

Note: You need to install transformers Python package to run the following example.

+
import torch
+from transformers import BertModel
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 128
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model, dtype=torch.bfloat16)
+######################################################  # noqa F401
+
+# Note: bf16 inference requires amp.autocast() context  # noqa F401
+with torch.no_grad(), torch.cpu.amp.autocast():
+    d = torch.randint(vocab_size, size=[batch_size, seq_length])
+    model = torch.jit.trace(model, (d,), check_trace=False, strict=False)
+    model = torch.jit.freeze(model)
+
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
+
TorchDynamo Mode (Beta, NEW feature from 2.0.0)
+
+
Resnet50
+

Note: You need to install torchvision Python package to run the following example.

+
import torch
+import torchvision.models as models
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(128, 3, 224, 224)
+
+# Beta Feature
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model, dtype=torch.bfloat16, weights_prepack=False)
+model = torch.compile(model, backend="ipex")
+######################################################  # noqa F401
+
+# Note: bf16 inference requires amp.autocast() context  # noqa F401
+with torch.no_grad(), torch.cpu.amp.autocast():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
BERT
+

Note: You need to install transformers Python package to run the following example.

+
import torch
+from transformers import BertModel
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 128
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+# Beta Feature
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model, dtype=torch.bfloat16, weights_prepack=False)
+model = torch.compile(model, backend="ipex")
+######################################################  # noqa F401
+
+# Note: bf16 inference requires amp.autocast() context  # noqa F401
+with torch.no_grad(), torch.cpu.amp.autocast():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
+
+

Fast Bert (Prototype)

+

Note: You need to install transformers Python package to run the following example.

+
import torch
+from transformers import BertModel
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 1
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+torch.manual_seed(43)
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.fast_bert(model, dtype=torch.bfloat16)
+######################################################  # noqa F401
+
+with torch.no_grad():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+

INT8

+

Starting from Intel® Extension for PyTorch* 1.12.0, quantization feature supports both static and dynamic modes.

+
+
Static Quantization
+
+
Calibration
+

Please follow the steps below to perform calibration for static quantization:

+
    +
  1. Import intel_extension_for_pytorch as ipex.

  2. +
  3. Import prepare and convert from intel_extension_for_pytorch.quantization.

  4. +
  5. Instantiate a config object from torch.ao.quantization.QConfig to save configuration data during calibration.

  6. +
  7. Prepare model for calibration.

  8. +
  9. Perform calibration against dataset.

  10. +
  11. Invoke ipex.quantization.convert function to apply the calibration configure object to the fp32 model object to get an INT8 model.

  12. +
  13. Save the INT8 model into a pt file.

  14. +
+

Note: You need to install torchvision Python package to run the following example.

+
import torch
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+from intel_extension_for_pytorch.quantization import prepare, convert
+
+######################################################  # noqa F401
+
+##### Example Model #####  # noqa F401
+import torchvision.models as models
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(128, 3, 224, 224)
+#########################  # noqa F401
+
+qconfig_mapping = ipex.quantization.default_static_qconfig_mapping
+# Alternatively, define your own qconfig_mapping:
+# from torch.ao.quantization import MinMaxObserver, PerChannelMinMaxObserver, QConfig, QConfigMapping
+# qconfig = QConfig(
+#        activation=MinMaxObserver.with_args(qscheme=torch.per_tensor_affine, dtype=torch.quint8),
+#        weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric))
+# qconfig_mapping = QConfigMapping().set_global(qconfig)
+prepared_model = prepare(model, qconfig_mapping, example_inputs=data, inplace=False)
+
+##### Example Dataloader #####  # noqa F401
+import torchvision
+
+DOWNLOAD = True
+DATA = "datasets/cifar10/"
+
+transform = torchvision.transforms.Compose(
+    [
+        torchvision.transforms.Resize((224, 224)),
+        torchvision.transforms.ToTensor(),
+        torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
+    ]
+)
+train_dataset = torchvision.datasets.CIFAR10(
+    root=DATA,
+    train=True,
+    transform=transform,
+    download=DOWNLOAD,
+)
+calibration_data_loader = torch.utils.data.DataLoader(
+    dataset=train_dataset, batch_size=128
+)
+
+with torch.no_grad():
+    for batch_idx, (d, target) in enumerate(calibration_data_loader):
+        print(f"calibrated on batch {batch_idx} out of {len(calibration_data_loader)}")
+        prepared_model(d)
+##############################  # noqa F401
+
+converted_model = convert(prepared_model)
+with torch.no_grad():
+    traced_model = torch.jit.trace(converted_model, data)
+    traced_model = torch.jit.freeze(traced_model)
+
+traced_model.save("static_quantized_model.pt")
+
+print("Saved model to: static_quantized_model.pt")
+
+
+
+
+
Deployment
+

For deployment, the INT8 model is loaded from the local file and can be used directly for sample inference.

+

Follow the steps below:

+
    +
  1. Import intel_extension_for_pytorch as ipex.

  2. +
  3. Load the INT8 model from the saved file.

  4. +
  5. Run inference.

  6. +
+
import torch
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex  # noqa F401
+
+######################################################  # noqa F401
+
+model = torch.jit.load("static_quantized_model.pt")
+model.eval()
+model = torch.jit.freeze(model)
+data = torch.rand(128, 3, 224, 224)
+
+with torch.no_grad():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
+
Dynamic Quantization
+

Please follow the steps below to perform dynamic quantization:

+
    +
  1. Import intel_extension_for_pytorch as ipex.

  2. +
  3. Import prepare and convert from intel_extension_for_pytorch.quantization.

  4. +
  5. Instantiate a config object from torch.ao.quantization.QConfig to save configuration data during calibration.

  6. +
  7. Prepare model for quantization.

  8. +
  9. Convert the model.

  10. +
  11. Run inference to perform dynamic quantization.

  12. +
  13. Save the INT8 model into a pt file.

  14. +
+

Note: You need to install transformers Python package to run the following example.

+
import torch
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+from intel_extension_for_pytorch.quantization import prepare, convert
+
+######################################################  # noqa F401
+
+##### Example Model #####  # noqa F401
+from transformers import BertModel
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 128
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+#########################  # noqa F401
+
+qconfig_mapping = ipex.quantization.default_dynamic_qconfig_mapping
+# Alternatively, define your own qconfig:
+# from torch.ao.quantization import PerChannelMinMaxObserver, PlaceholderObserver, QConfig, QConfigMapping
+# qconfig = QConfig(
+#        activation = PlaceholderObserver.with_args(dtype=torch.float, is_dynamic=True),
+#        weight = PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric))
+# qconfig_mapping = QConfigMapping().set_global(qconfig)
+prepared_model = prepare(model, qconfig_mapping, example_inputs=data)
+
+converted_model = convert(prepared_model)
+with torch.no_grad():
+    traced_model = torch.jit.trace(
+        converted_model, (data,), check_trace=False, strict=False
+    )
+    traced_model = torch.jit.freeze(traced_model)
+
+traced_model.save("dynamic_quantized_model.pt")
+
+print("Saved model to: dynamic_quantized_model.pt")
+
+
+
+
+
+
+

Large Language Model (LLM)

+

Intel® Extension for PyTorch* provides dedicated optimization for running Large Language Models (LLM) faster. +A set of data types are supported for various scenarios, including FP32, BF16, Smooth Quantization INT8, Weight Only Quantization INT8/INT4 (prototype).

+

Note: You need to install transformers==4.43.2 Python package to run the following example. +In addition, you may need to log in your HuggingFace account to access the pretrained model files. +Please refer to HuggingFace login.

+
+

FP32/BF16

+
import torch
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+######################################################  # noqa F401
+import argparse
+from transformers import (
+    AutoConfig,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+)
+
+# args
+parser = argparse.ArgumentParser("Generation script (fp32/bf16 path)", add_help=False)
+parser.add_argument(
+    "--dtype",
+    type=str,
+    choices=["float32", "bfloat16"],
+    default="float32",
+    help="choose the weight dtype and whether to enable auto mixed precision or not",
+)
+parser.add_argument(
+    "--max-new-tokens", default=32, type=int, help="output max new tokens"
+)
+parser.add_argument(
+    "--prompt", default="What are we having for dinner?", type=str, help="input prompt"
+)
+parser.add_argument("--greedy", action="store_true")
+parser.add_argument("--batch-size", default=1, type=int, help="batch size")
+args = parser.parse_args()
+print(args)
+
+# dtype
+amp_enabled = True if args.dtype != "float32" else False
+amp_dtype = getattr(torch, args.dtype)
+
+# load model
+model_id = "facebook/opt-125m"
+config = AutoConfig.from_pretrained(model_id, torchscript=True, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=amp_dtype,
+    config=config,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = model.eval()
+model = model.to(memory_format=torch.channels_last)
+
+# Intel(R) Extension for PyTorch*
+#################### code changes ####################  # noqa F401
+model = ipex.llm.optimize(
+    model,
+    dtype=amp_dtype,
+    inplace=True,
+    deployment_mode=True,
+)
+######################################################  # noqa F401
+
+# generate args
+num_beams = 1 if args.greedy else 4
+generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=num_beams)
+
+# input prompt
+prompt = args.prompt
+input_size = tokenizer(prompt, return_tensors="pt").input_ids.size(dim=1)
+print("---- Prompt size:", input_size)
+prompt = [prompt] * args.batch_size
+
+# inference
+with torch.no_grad(), torch.inference_mode(), torch.cpu.amp.autocast(
+    enabled=amp_enabled
+):
+    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+    gen_ids = model.generate(
+        input_ids, max_new_tokens=args.max_new_tokens, **generate_kwargs
+    )
+    gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
+    input_tokens_lengths = [x.shape[0] for x in input_ids]
+    output_tokens_lengths = [x.shape[0] for x in gen_ids]
+    total_new_tokens = [
+        o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)
+    ]
+    print(gen_text, total_new_tokens, flush=True)
+
+
+
+
+

Smooth Quantization INT8

+

The typical steps shown in the example are:

+
    +
  1. Calibration process: Run the example script specifying --calibration, along with other related arguments. +When the calibration process is completed, the quantization summary files would be generated.

  2. +
  3. Model inference process: Run the example script without specifying --calibration. In this process the quantized model +will be generated via the original model and the quantization config and summary files, and will +generate results for the input prompt.

  4. +
+
import torch
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+######################################################  # noqa F401
+import argparse
+from transformers import (
+    AutoConfig,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+)
+
+# args
+parser = argparse.ArgumentParser(
+    "Generation script (static quantization path)", add_help=False
+)
+parser.add_argument(
+    "--dtype",
+    type=str,
+    choices=["float32", "bfloat16"],
+    default="float32",
+    help="choose the weight dtype and whether to enable auto mixed precision or not",
+)
+parser.add_argument(
+    "--max-new-tokens", default=32, type=int, help="output max new tokens"
+)
+parser.add_argument(
+    "--prompt", default="What are we having for dinner?", type=str, help="input prompt"
+)
+parser.add_argument("--greedy", action="store_true")
+parser.add_argument("--batch-size", default=1, type=int, help="batch size")
+parser.add_argument("--calibration", action="store_true")
+parser.add_argument(
+    "--calibration-samples",
+    default=512,
+    type=int,
+    help="total number of calibration samples",
+)
+parser.add_argument(
+    "--int8-qconfig",
+    nargs="?",
+    default="./qconfig.json",
+    help="static quantization factors summary files generated by calibration",
+)
+parser.add_argument("--dataset", nargs="?", default="NeelNanda/pile-10k")
+parser.add_argument(
+    "--alpha", default=0.5, type=float, help="alpha value for smoothquant"
+)
+args = parser.parse_args()
+print(args)
+
+
+# dtype
+amp_enabled = True if args.dtype != "float32" and not calibration else False
+amp_dtype = getattr(torch, args.dtype) if not calibration else torch.float32
+
+# load model
+model_id = "meta-llama/Llama-2-7b-hf"
+config = AutoConfig.from_pretrained(model_id, torchscript=True, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=amp_dtype,
+    config=config,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = model.eval()
+model = model.to(memory_format=torch.channels_last)
+
+num_beams = 1 if args.greedy else 4
+beam_idx_tmp = torch.zeros(
+    (2048, int(args.batch_size * num_beams)), dtype=torch.long
+).contiguous()
+global_past_key_value = [
+    (
+        torch.zeros(1, 0, 0, 1, dtype=torch.long).contiguous(),
+        torch.zeros(
+            [
+                1,
+                model.config.num_attention_heads,
+                1,
+                int(model.config.hidden_size / model.config.num_attention_heads),
+            ]
+        ).contiguous(),
+        torch.zeros(
+            [
+                1,
+                user_model.config.num_attention_heads,
+                1,
+                int(model.config.hidden_size / model.config.num_attention_heads),
+            ]
+        ).contiguous(),
+        beam_idx_tmp,
+    )
+    for i in range(model.config.num_hidden_layers)
+]
+
+
+# Intel(R) Extension for PyTorch*
+#################### code changes ####################  # noqa F401
+class Calibration:
+    def __init__(self, dataset, tokenizer, batch_size=1, pad_val=1, pad_max=512):
+        self.dataset = dataset
+        self.tokenizer = tokenizer
+        self.batch_size = batch_size
+        self.pad_val = pad_val
+        self.pad_max = pad_max
+
+        # tokenize the dataset
+        self.dataset = self.dataset.map(self.tokenize_function, batched=True)
+        self.dataset.set_format(type="torch", columns=["input_ids"])
+
+    @torch.no_grad()
+    def tokenize_function(self, examples):
+        if "prompt" in examples:
+            example = self.tokenizer(examples["prompt"])
+        elif "text" in examples:
+            example = self.tokenizer(examples["text"])
+        elif "code" in examples:
+            example = self.tokenizer(examples["code"])
+        return example
+
+    @torch.no_grad()
+    def collate_batch(self, batch):
+        position_ids_padded = []
+        input_ids_padded = []
+        last_ind = []
+        attention_mask_padded = []
+        for text in batch:
+            input_ids = text["input_ids"]
+            input_ids = (
+                input_ids[: int(self.pad_max)]
+                if len(input_ids) > int(self.pad_max)
+                else input_ids
+            )
+            last_ind.append(input_ids.shape[0] - 1)
+            attention_mask = torch.ones(len(input_ids))
+            position_ids = torch.arange(len(input_ids))
+            input_ids_padded.append(input_ids)
+            attention_mask_padded.append(attention_mask)
+            position_ids_padded.append(position_ids)
+        return (
+            (
+                torch.vstack(input_ids_padded),
+                torch.vstack(attention_mask_padded),
+                torch.vstack(position_ids_padded),
+                tuple(global_past_key_value),
+            ),
+            torch.tensor(last_ind),
+        )
+
+
+calib_dataset = load_dataset(args.dataset, split="train")
+calib_evaluator = Calibration(calib_dataset, tokenizer, args.batch_size)
+calib_dataloader = DataLoader(
+    calib_evaluator.dataset,
+    batch_size=1,
+    shuffle=False,
+    collate_fn=calib_evaluator.collate_batch,
+)
+
+qconfig = ipex.quantization.get_smooth_quant_qconfig_mapping(alpha=args.alpha)
+
+if args.calibration:
+    example_inputs = None
+    for i, (
+        (input_ids, attention_mask, position_ids, past_key_values),
+        last_ind,
+    ) in enumerate(calib_dataloader):
+        example_inputs = (input_ids, attention_mask, position_ids, past_key_values)
+        break
+    from intel_extension_for_pytorch.quantization import prepare
+
+    model = ipex.llm.optimize(
+        model.eval(),
+        dtype=amp_dtype,
+        quantization_config=qconfig,
+        inplace=True,
+        deployment_mode=False,
+    )
+    prepared_model = prepare(model.eval(), qconfig, example_inputs=example_inputs)
+    with torch.no_grad():
+        for i, (
+            (input_ids, attention_mask, position_ids, past_key_values),
+            last_ind,
+        ) in enumerate(calib_dataloader):
+            if i == args.calibration_samples:
+                break
+            prepared_model(
+                input_ids,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+            )
+
+    prepared_model.save_qconf_summary(qconf_summary=args.int8_qconfig)
+    print(
+        "calibration Done! Will exit and please launch model quantization and benchmark"
+    )
+    exit(0)
+else:
+    model = ipex.llm.optimize(
+        model.eval(),
+        dtype=amp_dtype,
+        quantization_config=qconfig,
+        qconfig_summary_file=args.int8_qconfig,
+        inplace=True,
+        deployment_mode=True,
+    )
+    print("model quantization - Done!")
+
+######################################################  # noqa F401
+
+# generate args
+num_beams = 1 if args.greedy else 4
+generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=num_beams)
+
+# input prompt
+prompt = args.prompt
+input_size = tokenizer(prompt, return_tensors="pt").input_ids.size(dim=1)
+print("---- Prompt size:", input_size)
+prompt = [prompt] * args.batch_size
+
+# inference
+with torch.no_grad(), torch.inference_mode(), torch.cpu.amp.autocast(
+    enabled=amp_enabled
+):
+    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+    gen_ids = model.generate(
+        input_ids, max_new_tokens=args.max_new_tokens, **generate_kwargs
+    )
+    gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
+    input_tokens_lengths = [x.shape[0] for x in input_ids]
+    output_tokens_lengths = [x.shape[0] for x in gen_ids]
+    total_new_tokens = [
+        o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)
+    ]
+    print(gen_text, total_new_tokens, flush=True)
+
+
+
+
+

Weight Only Quantization INT8/INT4

+
import torch
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+######################################################  # noqa F401
+import argparse
+from transformers import (
+    AutoConfig,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+)
+
+# args
+parser = argparse.ArgumentParser(
+    "Generation script (weight only quantization path)", add_help=False
+)
+parser.add_argument(
+    "--dtype",
+    type=str,
+    choices=["float32", "bfloat16"],
+    default="float32",
+    help="choose the weight dtype and whether to enable auto mixed precision or not",
+)
+parser.add_argument(
+    "--max-new-tokens", default=32, type=int, help="output max new tokens"
+)
+parser.add_argument(
+    "--prompt", default="What are we having for dinner?", type=str, help="input prompt"
+)
+parser.add_argument("--greedy", action="store_true")
+parser.add_argument("--batch-size", default=1, type=int, help="batch size")
+# Intel(R) Extension for PyTorch*
+#################### code changes ####################  # noqa F401
+parser.add_argument(
+    "--lowp-mode",
+    choices=["AUTO", "BF16", "FP32", "INT8", "FP16"],
+    default="AUTO",
+    type=str,
+    help="low precision mode for weight only quantization. "
+    "It indicates data type for computation for speedup at the cost "
+    "of accuracy. Unrelated to activation or weight data type."
+    "It is not supported yet to use lowp_mode=INT8 for INT8 weight, "
+    "falling back to lowp_mode=BF16 implicitly in this case."
+    "If set to AUTO, lowp_mode is determined by weight data type: "
+    "lowp_mode=BF16 is used for INT8 weight "
+    "and lowp_mode=INT8 used for INT4 weight",
+)
+parser.add_argument(
+    "--weight-dtype",
+    choices=["INT8", "INT4"],
+    default="INT8",
+    type=str,
+    help="weight data type for weight only quantization. Unrelated to activation"
+    " data type or lowp-mode. If `--low-precision-checkpoint` is given, weight"
+    " data type is always INT4 and this argument is not needed.",
+)
+parser.add_argument(
+    "--low-precision-checkpoint",
+    default="",
+    type=str,
+    help="Low precision checkpoint file generated by calibration, such as GPTQ. It contains"
+    " modified weights, scales, zero points, etc. For better accuracy of weight only"
+    " quantization with INT4 weight.",
+)
+######################################################  # noqa F401
+args = parser.parse_args()
+print(args)
+
+# dtype
+amp_enabled = True if args.dtype != "float32" else False
+amp_dtype = getattr(torch, args.dtype)
+
+# load model
+model_id = "facebook/opt-125m"
+config = AutoConfig.from_pretrained(model_id, torchscript=True, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=amp_dtype,
+    config=config,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = model.eval()
+model = model.to(memory_format=torch.channels_last)
+
+# Intel(R) Extension for PyTorch*
+#################### code changes ####################  # noqa F401
+from intel_extension_for_pytorch.quantization import WoqWeightDtype
+
+weight_dtype = (
+    WoqWeightDtype.INT4 if args.weight_dtype == "INT4" else WoqWeightDtype.INT8
+)
+
+if args.lowp_mode == "INT8":
+    lowp_mode = ipex.quantization.WoqLowpMode.INT8
+elif args.lowp_mode == "FP32":
+    lowp_mode = ipex.quantization.WoqLowpMode.NONE
+elif args.lowp_mode == "FP16":
+    lowp_mode = ipex.quantization.WoqLowpMode.FP16
+elif args.lowp_mode == "BF16":
+    lowp_mode = ipex.quantization.WoqLowpMode.BF16
+else:  # AUTO
+    if args.low_precision_checkpoint != "" or weight_dtype == WoqWeightDtype.INT4:
+        lowp_mode = ipex.quantization.WoqLowpMode.INT8
+    else:
+        lowp_mode = ipex.quantization.WoqLowpMode.BF16
+
+qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping(
+    weight_dtype=weight_dtype, lowp_mode=lowp_mode
+)
+if args.low_precision_checkpoint != "":
+    low_precision_checkpoint = torch.load(args.low_precision_checkpoint)
+else:
+    low_precision_checkpoint = None
+model = ipex.llm.optimize(
+    model.eval(),
+    dtype=amp_dtype,
+    quantization_config=qconfig,
+    low_precision_checkpoint=low_precision_checkpoint,
+    deployment_mode=True,
+    inplace=True,
+)
+
+######################################################  # noqa F401
+
+# generate args
+num_beams = 1 if args.greedy else 4
+generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=num_beams)
+
+# input prompt
+prompt = args.prompt
+input_size = tokenizer(prompt, return_tensors="pt").input_ids.size(dim=1)
+print("---- Prompt size:", input_size)
+prompt = [prompt] * args.batch_size
+
+# inference
+with torch.no_grad(), torch.inference_mode(), torch.cpu.amp.autocast(
+    enabled=amp_enabled
+):
+    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+    gen_ids = model.generate(
+        input_ids, max_new_tokens=args.max_new_tokens, **generate_kwargs
+    )
+    gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
+    input_tokens_lengths = [x.shape[0] for x in input_ids]
+    output_tokens_lengths = [x.shape[0] for x in gen_ids]
+    total_new_tokens = [
+        o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)
+    ]
+    print(gen_text, total_new_tokens, flush=True)
+
+
+

Note: Please check LLM Best Known Practice Page +for detailed environment setup and LLM workload running instructions.

+
+
+
+
+

C++

+

To work with libtorch, C++ library of PyTorch, Intel® Extension for PyTorch* provides its C++ dynamic library as well. The C++ library is supposed to handle inference workload only, such as service deployment. For regular development, use the Python interface. Unlike using libtorch, no specific code changes are required. Compilation follows the recommended methodology with CMake. Detailed instructions can be found in PyTorch tutorial.

+

During compilation, Intel optimizations will be activated automatically once C++ dynamic library of Intel® Extension for PyTorch* is linked.

+

The example code below works for all data types.

+

example-app.cpp

+
#include <torch/script.h>
+#include <iostream>
+#include <memory>
+
+int main(int argc, const char* argv[]) {
+  torch::jit::script::Module module;
+  try {
+    module = torch::jit::load(argv[1]);
+  } catch (const c10::Error& e) {
+    std::cerr << "error loading the model\n";
+    return -1;
+  }
+
+  std::vector<torch::jit::IValue> inputs;
+  torch::Tensor input = torch::rand({1, 3, 224, 224});
+  inputs.push_back(input);
+
+  at::Tensor output = module.forward(inputs).toTensor();
+  std::cout << output.slice(/*dim=*/1, /*start=*/0, /*end=*/5) << std::endl;
+  std::cout << "Execution finished" << std::endl;
+
+  return 0;
+}
+
+
+

CMakeLists.txt

+
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
+project(example-app)
+
+find_package(IPEX REQUIRED)
+
+add_executable(example-app example-app.cpp)
+target_link_libraries(example-app "${TORCH_IPEX_LIBRARIES}")
+
+set_property(TARGET example-app PROPERTY CXX_STANDARD 17)
+
+
+

Command for compilation

+
$ cd examples/cpu/inference/cpp
+$ mkdir build
+$ cd build
+$ cmake -DCMAKE_PREFIX_PATH=<LIBPYTORCH_PATH> ..
+$ make
+
+
+

If Found IPEX is shown as with a dynamic library path, the extension had been linked into the binary. This can be verified with Linux command ldd.

+
$ cmake -DCMAKE_PREFIX_PATH=/workspace/libtorch ..
+-- The C compiler identification is GNU XX.X.X
+-- The CXX compiler identification is GNU XX.X.X
+-- Detecting C compiler ABI info
+-- Detecting C compiler ABI info - done
+-- Check for working C compiler: /usr/bin/cc - skipped
+-- Detecting C compile features
+-- Detecting C compile features - done
+-- Detecting CXX compiler ABI info
+-- Detecting CXX compiler ABI info - done
+-- Check for working CXX compiler: /usr/bin/c++ - skipped
+-- Detecting CXX compile features
+-- Detecting CXX compile features - done
+CMake Warning at /workspace/libtorch/share/cmake/Torch/TorchConfig.cmake:22 (message):
+  static library kineto_LIBRARY-NOTFOUND not found.
+Call Stack (most recent call first):
+  /workspace/libtorch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
+  /workspace/libtorch/share/cmake/IPEX/IPEXConfig.cmake:84 (FIND_PACKAGE)
+  CMakeLists.txt:4 (find_package)
+
+
+-- Found Torch: /workspace/libtorch/lib/libtorch.so
+-- Found IPEX: /workspace/libtorch/lib/libintel-ext-pt-cpu.so
+-- Configuring done
+-- Generating done
+-- Build files have been written to: examples/cpu/inference/cpp/build
+
+$ ldd example-app
+        ...
+        libtorch.so => /workspace/libtorch/lib/libtorch.so (0x00007f3cf98e0000)
+        libc10.so => /workspace/libtorch/lib/libc10.so (0x00007f3cf985a000)
+        libintel-ext-pt-cpu.so => /workspace/libtorch/lib/libintel-ext-pt-cpu.so (0x00007f3cf70fc000)
+        libtorch_cpu.so => /workspace/libtorch/lib/libtorch_cpu.so (0x00007f3ce16ac000)
+        ...
+        libdnnl_graph.so.0 => /workspace/libtorch/lib/libdnnl_graph.so.0 (0x00007f3cde954000)
+        ...
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features.html b/cpu/2.4.0+cpu/tutorials/features.html new file mode 100644 index 000000000..e7e090309 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features.html @@ -0,0 +1,461 @@ + + + + + + + Features — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Features

+

This section provides a detailed overview of supported features.

+
+

Easy-to-use Python API

+

With only two or three clauses added to your original code, Intel® Extension for PyTorch* provides simple frontend Python APIs and utilities to get performance optimizations such as graph optimization and operator optimization.

+

Check the API Documentation for API functions description and Examples for usage guidance.

+
+

Note

+

The package name used when you import Intel® Extension for PyTorch* changed +from intel_pytorch_extension (for versions 1.2.0 through 1.9.0) to +intel_extension_for_pytorch (for versions 1.10.0 and later). Use the +correct package name depending on the version you are using.

+
+
+
+

Large Language Models (LLM, NEW feature from 2.1.0)

+

In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Starting from 2.1.0, specific optimizations for certain LLM models are +introduced in the Intel® Extension for PyTorch*.

+

For more detailed information, check LLM Optimizations Overview.

+
+
+

torch.compile (Beta, NEW feature from 2.0.0)

+

PyTorch* 2.0 introduces a new feature torch.compile to speed up PyTorch* code. It makes PyTorch code run faster by JIT-compiling of PyTorch code into optimized kernels. Intel® Extension for PyTorch* enables a backend, ipex, in the torch.compile to optimize generation of the graph model.

+

To use the feature, import the Intel® Extension for PyTorch* and set the backend parameter of the torch.compile to ipex.

+

With torch.compile backend set to ipex, the following will happen:

+
    +
  1. Register Intel® Extension for PyTorch* operators to Inductor.

  2. +
  3. Custom fusions at FX graph level, e.g., the migration of existing TorchScript-based fusion kernels in IPEX to inductor, pattern-based fusions to achieve peak performance.

  4. +
+

While optimizations with torch.compile apply to backend, invocation of the ipex.optimize function is highly recommended as well to apply optimizations in frontend.

+
import torch
+import intel_extension_for_pytorch as ipex
+...
+model = ipex.optimize(model, weights_prepack=False)
+model = torch.compile(model, backend='ipex')
+...
+
+
+
+
+

ISA Dynamic Dispatching

+

Intel® Extension for PyTorch* features dynamic dispatching functionality to automatically adapt execution binaries to the most advanced instruction set available on your machine.

+

For details, refer to ISA Dynamic Dispatching.

+
+
+
+
+

Auto Channels Last

+

Comparing to the default NCHW memory format, using channels_last (NHWC) memory format could further accelerate convolutional neural networks. In Intel® Extension for PyTorch*, NHWC memory format has been enabled for most key CPU operators. More detailed information is available at Channels Last.

+

Intel® Extension for PyTorch* automatically converts a model to channels last memory format when users optimize the model with ipex.optimize(model). With this feature, there is no need to manually apply model=model.to(memory_format=torch.channels_last) anymore. More detailed information is available at Auto Channels Last.

+
+
+
+
+

Auto Mixed Precision (AMP)

+

Low precision data type BFloat16 has been natively supported on 3rd Generation Xeon® Scalable Processors (aka Cooper Lake) with AVX512 instruction set. It will also be supported on the next generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions (Intel® AMX) instruction set providing further boosted performance. The support of Auto Mixed Precision (AMP) with BFloat16 for CPU and BFloat16 optimization of operators has been enabled in Intel® Extension for PyTorch*, and partially upstreamed to PyTorch master branch. These optimizations will be landed in PyTorch master through PRs that are being submitted and reviewed.

+

Prefer to use torch.cpu.amp.autocast() instead of torch.autocast(device_name=”cpu”).

+

For details, refer to Auto Mixed Precision (AMP).

+

Bfloat16 computation can be conducted on platforms with AVX512 instruction set. On platforms with AVX512 BFloat16 instruction, there will be an additional performance boost.

+
+
+
+
+

Graph Optimization

+

To further optimize TorchScript performance, Intel® Extension for PyTorch* supports transparent fusion of frequently used operator patterns such as Conv2D+ReLU and Linear+ReLU. +For more detailed information, check Graph Optimization.

+

Compared to eager mode, graph mode in PyTorch normally yields better performance from optimization methodologies such as operator fusion. Intel® Extension for PyTorch* provides further optimizations in graph mode. We recommend taking advantage of Intel® Extension for PyTorch* with TorchScript. You may wish to run with the torch.jit.trace() function first, since it generally works better with Intel® Extension for PyTorch* than using the torch.jit.script() function. More detailed information can be found at the pytorch.org website.

+
+
+
+
+

Operator Optimization

+

Intel® Extension for PyTorch* also optimizes operators and implements several customized operators for performance boosts. A few ATen operators are replaced by their optimized counterparts in Intel® Extension for PyTorch* via the ATen registration mechanism. Some customized operators are implemented for several popular topologies. For instance, ROIAlign and NMS are defined in Mask R-CNN. To improve performance of these topologies, Intel® Extension for PyTorch* also optimized these customized operators.

+
+
+class ipex.nn.FrozenBatchNorm2d(num_features: int, eps: float = 1e-05)
+

BatchNorm2d where the batch statistics and the affine parameters are fixed

+
+
Parameters:
+

num_features (int) – C from an expected input of size (N, C, H, W). +Input shape: (N, C, H, W). +Output shape: (N, C, H, W) (same shape as input).

+
+
+
+ +
+
+ipex.nn.functional.interaction(*args)
+

Get the interaction feature beyond different kinds of features (like gender +or hobbies), used in DLRM model.

+

For now, we only optimized “dot” interaction at DLRM Github repo. +Through this, we use the dot product to represent the interaction feature +between two features.

+

For example, if feature 1 is “Man” which is represented by [0.1, 0.2, 0.3], +and feature 2 is “Like play football” which is represented by [-0.1, 0.3, 0.2].

+

The dot interaction feature is +([0.1, 0.2, 0.3] * [-0.1, 0.3, 0.2]^T) = -0.1 + 0.6 + 0.6 = 1.1

+
+
Parameters:
+

*args – Multiple tensors which represent different features. +Input shape: N * (B, D), where N is the number of different kinds of features, +B is the batch size, D is feature size. +Output shape: (B, D + N * ( N - 1 ) / 2).

+
+
+
+ +
+
+class ipex.nn.modules.MergedEmbeddingBag(embedding_specs: List[EmbeddingSpec])
+

Merge multiple Pytorch EmbeddingBag objects into a single torch.nn.Module object.

+

At the current stage:

+

MergedEmbeddingBag assumes to be constructed from nn.EmbeddingBag with sparse=False, returns dense gradients.

+

MergedEmbeddingBagWithSGD does not return gradients, backward step and weights update step are fused.

+

Native usage of multiple EmbeddingBag objects is:

+
>>> EmbLists = torch.nn.Modulist(emb1, emb2, emb3, ..., emb_m)
+>>> inputs = [in1, in2, in3, ..., in_m]
+>>> outputs = []
+>>> for i in range(len(EmbLists)):
+>>>     outputs.append(Emb[in_i])
+
+
+

The optimized path is:

+
>>> EmbLists = torch.nn.Modulist(emb1, emb2, emb3, ..., emb_m)
+>>> merged_emb = MergedEmbeddingBagWithSGD.from_embeddingbag_list(EmbLists)
+>>> outputs = MergedEmbeddingBagWithSGD(inputs)
+
+
+

Computation benefits from the optimized path:

+
+

1). Pytorch OP dispatching overhead is minimized. If EmbeddingBag operations are not heavy, this dispatching +overhead brings big impact.

+

2). Parallelizations over embedding tables are merged into that over a single merged embedding table. This +could benefit low parallelization efficiency scenarios when data size read out from embedding tables are not +large enough.

+
+

Now MergedEmbeddingBagWithSGD is the only option running with an optimizer. We plan to add more optimizer support +in the future. Visit MergedEmbeddingBagWithSGD for introduction of MergedEmbeddingBagWith[Optimizer].

+
+ +
+
+class ipex.nn.modules.MergedEmbeddingBagWithSGD(embedding_specs: List[EmbeddingSpec], lr: float = 0.01, weight_decay: float = 0)
+

To support training with MergedEmbeddingBag for good performance, optimizer step is fused with backward function.

+

Native usage for multiple EmbeddingBag is:

+
>>> EmbLists = torch.nn.Modulist(emb1, emb2, emb3, ..., emb_m)
+>>> sgd = torch.optim.SGD(EmbLists.parameters(), lr=lr, weight_decay=weight_decay)
+>>> inputs = [in1, in2, in3, ..., in_m]
+>>> outputs = []
+>>> for i in range(len(EmbLists)):
+>>>     outputs.append(Emb[in_i])
+>>> sgd.zero_grad()
+>>> for i in range(len(outputs)):
+>>>     out.backward(grads[i])
+>>> sgd.step()
+
+
+

The optimized path is:

+
>>> # create MergedEmbeddingBagWithSGD module with optimizer args (lr and weight decay)
+>>> EmbLists = torch.nn.Modulist(emb1, emb2, emb3, ..., emb_m)
+>>> merged_emb = MergedEmbeddingBagWithSGD.from_embeddingbag_list(EmbLists, lr=lr, weight_decay=weight_decay)
+>>> # if you need to train with BF16 dtype, we provide split sgd on it
+>>> # merged_emb.to_bfloat16_train()
+>>> merged_input = merged_emb.linearize_indices_and_offsets(inputs)
+>>> outputs = MergedEmbeddingBagWithSGD(merged_input, need_linearize_indices_and_offsets=torch.BoolTensor([False]))
+>>> outputs.backward(grads)
+
+
+

Training benefits further from this optimization:

+
+

1). Pytorch OP dispatching overhead in backward and weight update process is saved.

+

2). Thread loading becomes more balanced during backward/weight update. In real world scenarios, Embedingbag +are often used to represent categorical features, while the categorical features often fit power law +distribution. For example, if we use one embedding table to represent the age range of a video game website +users, we might find most of them are between 10-19 or 20-29. So we may need to update the row which represent +10-19 or 20-29 frequently. Since updating these rows needs to write at the same memory address, we need to write +it by 1 thread (otherwise we will have write conflict or overhead to solve the conflict). The potential memory +write conflict can be simply addressed by merging multiple tables together.

+

3). Weights update is fused with backward together. We can immediately update the weight right after we get +gradients from the backward step and thus the memory access pattern becomes more friendly. Data access will +happen on cache more than on memory.

+
+
+ +

Auto kernel selection is a feature that enables users to tune for better performance with GEMM operations. We aim to provide good default performance by leveraging the best of math libraries and enabling weights_prepack. The feature was tested with broad set of models. If you want to try other options, you can use auto_kernel_selection toggle in ipex.optimize() to switch, and you can disable weights_prepack in ipex.optimize() if you are more concerned about the memory footprint than performance gain. However, in most cases, we recommend sticking with the default settings for the best experience.

+
+
+

Optimizer Optimization

+

Optimizers are one of key parts of the training workloads. Intel® Extension for PyTorch* brings two types of optimizations to optimizers:

+
    +
  1. Operator fusion for the computation in the optimizers.

  2. +
  3. SplitSGD for BF16 training, which reduces the memory footprint of the master weights by half.

  4. +
+

For details, refer to Optimizer Fusion and Split SGD

+
+
+
+
+

Runtime Extension

+

Intel® Extension for PyTorch* Runtime Extension provides PyTorch frontend APIs for users to get finer-grained control of the thread runtime and provides:

+
    +
  • Multi-stream inference via the Python frontend module MultiStreamModule.

  • +
  • Spawn asynchronous tasks from both Python and C++ frontend.

  • +
  • Program core bindings for OpenMP threads from both Python and C++ frontend.

  • +
+
+

Note

+

Intel® Extension for PyTorch* Runtime extension is still in the prototype stage. The API is subject to change. More detailed descriptions are available in the API Documentation.

+
+

For more detailed information, check Runtime Extension.

+
+
+
+
+

INT8 Quantization

+

Intel® Extension for PyTorch* provides built-in quantization recipes to deliver good statistical accuracy for most popular DL workloads including CNN, NLP and recommendation models.

+

Users are always recommended to try quantization with the built-in quantization recipe first with Intel® Extension for PyTorch* quantization APIs. For even higher accuracy demandings, users can try with separate recipe tuning APIs. The APIs are powered by Intel® Neural Compressor to take advantage of its tuning feature.

+

Smooth quantization (SmoothQuant) is a more recent post-training quantization (PTQ) solution which tackles the quantization error problem caused by systematic outliers in activations. SmoothQuant is commonly used for LLM quantization, and Intel® Extension for PyTorch* has provided built-in support for this solution.

+

Check more detailed information for INT8 Quantization and INT8 recipe tuning API guide (Prototype). In addition, SmoothQuant specific argument introduction and examples can be checked in SmoothQuant recipe tuning API guide (Prototype).

+
+
+
+
+

Codeless Optimization (Prototype, NEW feature from 1.13.0)

+

This feature enables users to get performance benefits from Intel® Extension for PyTorch* without changing Python scripts. It hopefully eases the usage and has been verified working well with broad scope of models, though in few cases there could be small overhead comparing to applying optimizations with Intel® Extension for PyTorch* APIs.

+

For more detailed information, check Codeless Optimization.

+
+
+
+
+

Graph Capture (Prototype, NEW feature from 1.13.0)

+

Since graph mode is key for deployment performance, this feature automatically captures graphs based on set of technologies that PyTorch supports, such as TorchScript and TorchDynamo. Users won’t need to learn and try different PyTorch APIs to capture graphs, instead, they can turn on a new boolean flag –graph_mode (default off) in ipex.optimize() to get the best of graph optimization.

+

For more detailed information, check Graph Capture.

+
+
+
+
+

HyperTune (Prototype, NEW feature from 1.13.0)

+

HyperTune is an prototype feature to perform hyperparameter/execution configuration searching. The searching is used in various areas such as optimization of hyperparameters of deep learning models. The searching is extremely useful in real situations when the number of hyperparameters, including configuration of script execution, and their search spaces are huge that manually tuning these hyperparameters/configuration is impractical and time consuming. Hypertune automates this process of execution configuration searching for the launcher and Intel® Extension for PyTorch*.

+

For more detailed information, check HyperTune.

+
+
+
+
+

Fast BERT Optimization (Prototype, NEW feature from 2.0.0)

+

Intel proposed a technique to speed up BERT workloads. Implementation is integrated into Intel® Extension for PyTorch*. An API ipex.fast_bert() is provided for a simple usage.

+

Currently ipex.fast_bert API is well optimized for training tasks. It works for inference tasks, though, please use the ipex.optimize API with graph mode to achieve the peak performance.

+

For more detailed information, check Fast BERT.

+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/amp.html b/cpu/2.4.0+cpu/tutorials/features/amp.html new file mode 100644 index 000000000..42dcf4cb0 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/amp.html @@ -0,0 +1,283 @@ + + + + + + + Auto Mixed Precision (AMP) — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Auto Mixed Precision (AMP)

+
+

Introduction

+

torch.cpu.amp provides convenience for auto data type conversion at runtime. Deep learning workloads can benefit from lower-precision floating point data types such as torch.float16 or torch.bfloat16, because of its lighter calculation workload and smaller memory usage. Accuracy is sacrificed when using lower-precision floating point data types so there’s a trade-off between accuracy and performance. Thus, some operations should use the slower but more accuratetorch.float32, while others can be converted to use the faster but less accurate torch.float16 data type. The Auto Mixed Precision (AMP) feature automates the tuning of data type conversions over all operators.

+

torch.cpu.amp only supports torch.bfloat16. It is the default lower precision floating point data type when torch.cpu.amp is enabled. torch.cpu.amp primarily benefits when running on Intel CPU with BFloat16 instruction set support.

+

Prefer to use torch.cpu.amp.autocast() instead of torch.autocast(device_name="cpu").

+
+
+

Use Case

+

The following simple network should show a speedup with mixed precision.

+
class SimpleNet(torch.nn.Module):
+    def __init__(self):
+        super(SimpleNet, self).__init__()
+        self.conv = torch.nn.Conv2d(64, 128, (3, 3), stride=(2, 2), padding=(1, 1), bias=False)
+
+    def forward(self, x):
+        return self.conv(x)
+
+
+
+

Default Precision

+

Without torch.cpu.amp, the network executes all operators with default precision (torch.float32).

+
model = SimpleNet()
+x = torch.rand(64, 64, 224, 224)
+y = model(x)
+
+
+
+
+

Inference with Eager Path

+

torch.cpu.amp.autocast is designed to be a context manager that allow scopes of your script to run with mixed precision. In these scopes, operations run in a data type chosen by the autocast class to improve performance while maintaining accuracy. See the operations category section for details on what precision the autocast class chooses for each operator, and under what circumstances.

+
model = SimpleNet().eval()
+x = torch.rand(64, 64, 224, 224)
+with torch.cpu.amp.autocast():
+    y = model(x)
+
+
+
+
+

Inference with TorchScript Path

+

torch.cpu.amp.autocast can be used with torch.jit.trace to apply graph optimization. Due to PyTorch limitation, only torch.jit.trace is supported.

+
model = SimpleNet().eval()
+x = torch.rand(64, 64, 224, 224)
+with torch.cpu.amp.autocast():
+    model = torch.jit.trace(model, x)
+    model = torch.jit.freeze(model)
+    y = model(x)
+
+
+
+
+

Training Support

+

torch.cpu.amp.autocast can be used in training to improve performance.

+
model = SimpleNet()
+optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
+for images, label in train_loader():
+    with torch.cpu.amp.autocast():
+        loss = criterion(model(images), label)
+    loss.backward()
+    optimizer.step()
+
+
+
+
+
+

Autocast Op Reference

+
+

Op Eligibility

+

Ops that run in float64 or non-floating-point dtypes are not eligible for mixed precision, and will run in these types whether or not autocast is enabled.

+

Only out-of-place ops and Tensor methods are eligible for mixed precision. In-place variants and calls that explicitly supply an out=... Tensor +are allowed in autocast-enabled regions, but won’t go through autocasting. For example, in an autocast-enabled region a.addmm(b, c) can autocast, but a.addmm_(b, c) and a.addmm(b, c, out=d) cannot. For best performance and stability, use out-of-place ops in autocast-enabled regions.

+
+
+

Op-Specific Behavior

+

The following lists describe the behavior of eligible ops in autocast-enabled regions. These ops always go through autocasting whether they are invoked as part of a torch.nn.Module, as a function, or as a torch.Tensor method. If functions are exposed in multiple namespaces, they go through autocasting regardless of the namespace.

+

Ops not listed below do not go through autocasting. They run in the type defined by their inputs. However, autocasting may still change the type in which unlisted ops run if they’re downstream from autocasted ops.

+

If an op is unlisted, we assume it’s numerically stable in bfloat16. If you believe that an unlisted op is numerically unstable in bfloat16, file a GitHub issue.

+
+

Ops that can autocast to bfloat16

+

conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, bmm, mm, baddbmm, addmm, addbmm, linear, matmul, conv_tbc, group_norm, _native_multi_head_attention

+
+
+

Ops that can autocast to float32

+

avg_pool3d, binary_cross_entropy, grid_sampler, polar, prod, quantile, nanquantile, stft, cdist, trace, view_as_complex, cholesky, cholesky_inverse, cholesky_solve, inverse, lu_solve, matrix_rank, orgqr, ormqr, pinverse, max_unpool2d, max_unpool3d, adaptive_avg_pool3d, reflection_pad1d, reflection_pad2d, replication_pad1d, replication_pad2d, replication_pad3d, mse_loss, cosine_embedding_loss, nll_loss, nll_loss2d, hinge_embedding_loss, poisson_nll_loss, smooth_l1_loss, cross_entropy_loss, l1_loss, huber_loss, margin_ranking_loss, soft_margin_loss, triplet_margin_loss, multi_margin_loss, ctc_loss, kl_div, multilabel_margin_loss, binary_cross_entropy_with_logits, fft_fft, fft_ifft, fft_fft2, fft_ifft2, fft_fftn, fft_ifftn, fft_rfft, fft_irfft, fft_rfft2, fft_irfft2, fft_rfftn, fft_irfftn, fft_hfft, fft_ihfft, linalg_cond, linalg_matrix_rank, linalg_solve, linalg_cholesky, linalg_svdvals, linalg_eigvals, linalg_eigvalsh, linalg_inv, linalg_householder_product, linalg_tensorinv, linalg_tensorsolve, fake_quantize_per_tensor_affine, eig, geqrf, lstsq, _lu_with_info, qr, svd, symeig, triangular_solve, fractional_max_pool2d, fractional_max_pool3d, adaptive_max_pool3d, multilabel_margin_loss_forward, linalg_qr, linalg_cholesky_ex, linalg_svd, linalg_eig, linalg_eigh, linalg_lstsq, linalg_inv_ex

+
+
+

Ops that promote to the widest input type

+

These ops don’t require a particular dtype for stability, but take multiple inputs and require that the inputs’ dtypes match. If all of the inputs are bfloat16, the op runs in bfloat16. If any of the inputs is float32, autocast casts all inputs to float32 and runs the op in float32.

+

cat, stack, index_copy

+

Some ops not listed here (e.g., binary ops like add) natively promote inputs without autocasting’s intervention. If inputs are a mixture of bfloat16 and float32, these ops run in float32 and produce float32 output, regardless of whether autocast is enabled.

+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/auto_channels_last.html b/cpu/2.4.0+cpu/tutorials/features/auto_channels_last.html new file mode 100644 index 000000000..32f54c27a --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/auto_channels_last.html @@ -0,0 +1,213 @@ + + + + + + + Auto Channels Last — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Auto Channels Last

+

Channels last memory format is known to have performance advantage over channels first memory format. Refer to Channels Last for details. +Intel® Extension for PyTorch* automatically converts the model to channels last memory format by default when users optimize their model with ipex.optimize(model).

+
+

Ease-of-use auto channels last API

+
+

default

+
model = ipex.optimize(model) # by default, model is channels last
+
+
+
+
+

enable

+
ipex.enable_auto_channels_last()
+model = ipex.optimize(model) # enable, model is channels last
+
+
+
+
+

disable

+
ipex.disable_auto_channels_last()
+model = ipex.optimize(model) # disable, model is channels first 
+
+
+
+
+
+

Known issue

+

For broad models, channels last memory format brings performance boost over channels first memory format. However, for few use cases, this may bring performance regression. If performance regression is observed, we recommend to feed sample input data to ipex.optimize(model, sample_input=...).

+
model = ipex.optimize(model, sample_input=...)
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/codeless_optimization.html b/cpu/2.4.0+cpu/tutorials/features/codeless_optimization.html new file mode 100644 index 000000000..1ea1452b3 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/codeless_optimization.html @@ -0,0 +1,301 @@ + + + + + + + Codeless Optimization (Prototype) — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Codeless Optimization (Prototype)

+

This feature aims to get inference performance benefits from Intel® Extension for PyTorch* without changing code in your python scripts, which can raise Out-of-Box (OOB) experience to get started with Intel® Extension for PyTorch* easily. Users who already known how to apply optimizations with Intel® Extension for PyTorch* APIs are not targeted for this feature, due to the inevitable overhead and limitations we mentioned below.

+
+

Motivation

+

A typical use case of inference as in transformer can be simplified as the code snippet below:

+
import torch
+model = Model().eval()
+with torch.no_grad():
+    for input in dataloader():
+        model(**input)
+
+
+

To utilize optimizations of Intel® Extension for PyTorch* for optimum performance, several lines code changes are required/recommended.

+
import torch
+import intel_extension_for_pytorch as ipex # clause added
+model = Model().eval()
+model = ipex.optimization(model)          # clause added
+with torch.no_grad():
+  with torch.cpu.amp.autocast():          # clause added for running with BFloat16 (Optional)
+    input = ...                           # clause added for TorchScript (Optional, but recommended) 
+    model = torch.jit.trace(input)        # clause added for TorchScript (Optional, but recommended) 
+    model = torch.jit.freeze()            # clause added for TorchScript (Optional, but recommended) 
+    for input in dataloader():
+      model(**input)
+
+
+

With this feature, code changes above done manually are not required any more. Intel® Extension for PyTorch* optimizations will be applied automatically during execution in a monkey patch way.

+
    +
  • Automatically import intel_extension_for_pytorch package: It applies Intel® Extension for PyTorch* optimizations, such as: torch.embedding_bag, torch.cpu.amp.autocast. It also registers Intel® Extension for PyTorch* JIT fusion pass and thus benefits the graph mode inference performance.

  • +
  • Automatically apply ipex.optimize() function. Only features enabled by default parameter values are supported, such as:

    +
      +
    • Auto generate FX or Jit Graph.

    • +
    • Auto Channel Last convert.

    • +
    • Conv-Bn folding.

    • +
    • Weight prepack.

    • +
    • Replace dropout with identity.

    • +
    • Optimize LSTM.

    • +
    +
  • +
  • Automatically apply torch.cpu.amp.autocast with BFloat16 data type for inference.

  • +
+
+
+

Example Usage with HuggingFace

+

Let’s take the QA case in HuggingFace as an example.

+
+

The origin command with ipex launch

+

Here is the command to run with ipexrun.

+
clear && ipexrun --memory-allocator default --ninstances 2 --ncores-per-instance 28 run_qa.py --model_name_or_path bert-base-uncased --dataset_name squad --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/
+
+
+
+
+

Command to apply ipex optimization for FP32

+

Added --auto-ipex

+
clear && ipexrun --memory-allocator default --ninstances 2 --ncores-per-instance 28 --auto-ipex run_qa.py --model_name_or_path bert-base-uncased --dataset_name squad --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/
+
+
+
+
+

Command to apply ipex optimization for BF16

+

Added --auto-ipex --dtype bfloat16

+
clear && ipexrun --memory-allocator default --ninstances 2 --ncores-per-instance 28 --auto-ipex --dtype bfloat16 run_qa.py --model_name_or_path bert-base-uncased --dataset_name squad --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/
+
+
+
+
+
+

Use Case not supported

+
+

Module uses forward method explicitly instead of the __call__ attr

+
import torch
+class DummyModule(torch.nn.Module):
+    def __init__(self,):
+        super(DummyModule, self).__init__()
+        self.input1 = torch.randn(1, 3, 224, 224)
+        self.conv = torch.nn.Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3))
+        self.bn = torch.nn.BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
+
+    def forward(self, x):
+        return self.bn(self.conv(x))
+
+    def customized_forward(self, x):
+        return self.bn(self.conv(x))
+
+# Method1 will success
+DummyModule()(input)
+# Method2 will fail to apply ipex.optimize in the top-level model
+DummyModule().customized_forward(input)
+
+
+

If a model uses forward method explicitly instead of the __call__ attr, we are unable to hook the execution of this model. As result, we are unable to auto apply the optimizations to this DummyModule().

+
+
+

Already using ipex.optimize

+

User already invokes ipex.optimize in script is not targeted for this feature. The behaviour as repeated invoking of ipex.optimize is not defined. The second invoking of ipex.optimize for the same module will fail with error message to avoid this behaviour.

+
+
+

Already using Jit Trace

+

For Jit trace case (as below example code) is not planned to support at first stage:

+
import torch
+model = Model().eval()
+traced_model = torch.jit.trace(model, x).eval()
+traced_model = torch.jit.freeze(traced_model)
+with torch.no_grad():
+    for input in dataloader():
+        traced_model(input)
+
+
+

For 2 reasons:

+
    +
  • The auto graph mode support has already been included in ipex.optimize with graph first API in 1.13.

  • +
  • Extra launch parameters and Monkey patches are needed to support above case. We will focus on the feasibility of first use case in TorchVision and HuggingFace workloads.

  • +
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/fast_bert.html b/cpu/2.4.0+cpu/tutorials/features/fast_bert.html new file mode 100644 index 000000000..df302e62a --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/fast_bert.html @@ -0,0 +1,215 @@ + + + + + + + Fast BERT (Prototype) — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Fast BERT (Prototype)

+
+

Feature Description

+

Intel proposed a technique to speed up BERT workloads. Implementation leverages the idea from Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning & HPC Workloads.

+

Currently ipex.fast_bert API is only well optimized for training. For inference, it ensures functionality, while to get peak perf, please use ipex.optimize API + torchscript.

+
+
+

Prerequisite

+
    +
  • Transformers 4.6.0 ~ 4.43.2

  • +
+
+
+

Usage Example

+

An API ipex.fast_bert is provided for a simple usage. Usage of this API follows the pattern of ipex.optimize function. More detailed description of API is available at Fast BERT API doc

+
import torch
+from transformers import BertModel
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 1
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+torch.manual_seed(43)
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.fast_bert(model, dtype=torch.bfloat16)
+######################################################  # noqa F401
+
+with torch.no_grad():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/graph_capture.html b/cpu/2.4.0+cpu/tutorials/features/graph_capture.html new file mode 100644 index 000000000..700dac8fb --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/graph_capture.html @@ -0,0 +1,201 @@ + + + + + + + Graph Capture (Prototype) — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Graph Capture (Prototype)

+
+

Feature Description

+

This feature automatically applies a combination of TorchScript trace technique and TorchDynamo to try to generate a graph model, for providing a good user experience while keeping execution fast. Specifically, the process tries to generate a graph with TorchScript trace functionality first. In case of generation failure or incorrect results detected, it changes to TorchDynamo with TorchScript backend. Failure of the graph generation with TorchDynamo triggers a warning message. Meanwhile the generated graph model falls back to the original one. I.e. the inference workload runs in eager mode. Users can take advantage of this feature through a new knob --graph_mode of the ipex.optimize() function to automatically run into graph mode.

+
+
+

Usage Example

+
import torch
+import torchvision.models as models
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(1, 3, 224, 224)
+
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+
+model = ipex.optimize(model, graph_mode=True)
+######################################################  # noqa F401
+
+with torch.no_grad():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/graph_optimization.html b/cpu/2.4.0+cpu/tutorials/features/graph_optimization.html new file mode 100644 index 000000000..19d48bba9 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/graph_optimization.html @@ -0,0 +1,413 @@ + + + + + + + Graph Optimization — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Graph Optimization

+

Most Deep Learning models could be described as a DAG (directed acyclic graph). Optimizing a deep learning model from a graph perspective is straight forward. Compared to the operator optimization and algorithm optimization, the graph optimization is at a higher level. It covers not only the graph but also the runtime. From the operator perspective, the graph optimization contains the operator fusing and constant folding. From the runtime perspective, the graph optimization contains the operator scheduling, computation resources management, and memory management.

+

The Intel® Extension for PyTorch* focuses on operator related graph optimizations. The extension also provides some prototype features for the related runtime optimizations. Refer to the runtime extension for more details about runtime optimization.

+
+

Ease-of-use graph optimization API

+

The graph optimizations of Intel® Extension for PyTorch* are enabled by default. Users can disable it by calling:

+
ipex.enable_onednn_fusion(False)
+
+
+
+

FP32 and BF16 models

+
import torch
+import torchvision.models as models
+
+# Import the Intel Extension for PyTorch
+import intel_extension_for_pytorch as ipex
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+
+# Apply some fusions at the front end
+model = ipex.optimize(model, dtype=torch.float32)
+
+x = torch.randn(4, 3, 224, 224)
+with torch.no_grad():
+    model = torch.jit.trace(model, x, check_trace=False).eval()
+    # Fold the BatchNormalization and propagate constant
+    torch.jit.freeze(model)
+    # Print the graph
+    print(model.graph_for(x))
+
+print("Execution finished")
+
+
+

Compared to the original code, the model launcher needs to add a few lines of code and the extension will automatically accelerate the model. Regarding the RN50, the extension will automatically fuse the Conv + ReLU and Conv + Sum + ReLU as ConvReLU and ConvSumReLU. If you check the output of graph_for, you will observe the fused operators.

+
+
+

INT8 models

+
import torch
+import torchvision.models as models
+import intel_extension_for_pytorch as ipex
+from intel_extension_for_pytorch.quantization import prepare, convert
+
+# construct the model
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+qconfig = ipex.quantization.default_static_qconfig
+model.eval()
+example_inputs = torch.rand(1, 3, 224, 224)
+prepared_model = prepare(model, qconfig, example_inputs=example_inputs, inplace=False)
+
+##### Example Dataloader #####  # noqa F401
+import torchvision
+
+DOWNLOAD = True
+DATA = "datasets/cifar10/"
+
+transform = torchvision.transforms.Compose(
+    [
+        torchvision.transforms.Resize((224, 224)),
+        torchvision.transforms.ToTensor(),
+        torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
+    ]
+)
+train_dataset = torchvision.datasets.CIFAR10(
+    root=DATA,
+    train=True,
+    transform=transform,
+    download=DOWNLOAD,
+)
+calibration_data_loader = torch.utils.data.DataLoader(
+    dataset=train_dataset, batch_size=128
+)
+
+with torch.no_grad():
+    for batch_idx, (d, target) in enumerate(calibration_data_loader):
+        print(f"calibrated on batch {batch_idx} out of {len(calibration_data_loader)}")
+        prepared_model(d)
+##############################  # noqa F401
+
+convert_model = convert(prepared_model)
+with torch.no_grad():
+    traced_model = torch.jit.trace(convert_model, example_inputs)
+    traced_model = torch.jit.freeze(traced_model)
+
+traced_model.save("quantized_model.pt")
+
+# Deployment
+quantized_model = torch.jit.load("quantized_model.pt")
+quantized_model = torch.jit.freeze(quantized_model.eval())
+images = torch.rand(1, 3, 244, 244)
+with torch.no_grad():
+    output = quantized_model(images)
+
+print("Execution finished")
+
+
+
+
+
+

Methodology

+
+

Fusion

+
+

FP32 and BF16 fusion patterns

+
    +
  • Conv1D/Conv2D/Conv3D/Linear/ConvTranspose2D/ConvTranspose3D + Abs/Clamp/Elu/Exp/GELU/HardTanh/HardSwish/Log/Mish/Sigmoid/Pow/ReLU/Round/Sqrt/Square/Tanh/Leaky_ReLU/SiLU

  • +
  • Conv1D/Conv2D/Conv3D/Linear/ConvTranspose2D/ConvTranspose3D + Sigmoid + MUL

  • +
  • Conv1D/Conv2D/Conv3D/Linear + SUM

  • +
  • Conv1D/Conv2D/Conv3D + SUM + ReLU

  • +
  • Add + LayerNorm

  • +
  • Div + Add + Softmax

  • +
  • Linear + Linear + Linear

  • +
  • View + Transpose + Contiguous + View

  • +
+
+
+

INT8 fusion patterns

+

The ipex.quantization.convert(model, conf, inputs) API will convert an FP32 torch.nn.Module to a quantized JIT ScriptModule according to the given quantization recipes.

+

For example, for a FP32 model of one single convolution, the graph before and after conversion will be: +image

+

The oneDNN graph backend will select dequantize and convolution into one partition. During execution, this partition will execute a convolution with int8 as input and fp32 as output.

+

Here listed all the currently supported int8 patterns in Intel® Extension for PyTorch* using oneDNN graph backend:

+
    +
  1. Conv/Linear/Matmul related fusion patterns

    +
                                             |
    +                                     [Quantize]*
    +                |                        |
    +           Dequantize                Dequantize
    +                \                      /
    +           Conv1D/Conv2D/Conv3D/Linear/MatMul
    +                             |
    +         [Abs/Elu/GELU/HardTanh/Leaky_ReLU/Sigmoid/
    +    ReLU/Sqrt/Square/Tanh/[Dequantize+Add]*[0,1] ]*[0,3]
    +                             |
    +                         [Quantize]*
    +                             |
    +
    +
    +
         |              |
    +   Dequantize   Dequantize
    +      \___      ___/
    +          MatMul
    +             \    /
    +             Divide
    +                \   /
    +                [Add]*
    +                  |
    +
    +
    +
  2. +
  3. Non-Conv/Linear/Matmul related fusion patterns

    +
               |
    +       Dequantize
    +           |
    +       MaxPool2D
    +           |
    +        Quantize
    +
    +
    +
  4. +
  5. INT8-BF16 mixed-precision fusion patterns

    +
         |              |
    +   Dequantize   Dequantize
    +     |              |
    +    To             To
    +      \___      ___/
    +          MatMul
    +             \      /
    +             [Divide]*
    +                 \     /
    +                  [Add]*
    +                    |
    +
    +
    +
         |              |
    +   Dequantize   Dequantize
    +     |              |
    +    To             To
    +      \___      ___/
    +          MatMul
    +            |
    +          [GeLU]*
    +            |
    +           To
    +            |
    +         Quantize
    +            |
    +
    +
    +
         |              |
    +   Dequantize   Dequantize
    +     |              |
    +     To            To     Dequantize
    +      \___      ___/          |
    +          MatMul              To
    +             \_____        ___/
    +                    [Add]*
    +                      |
    +
    +
    +
  6. +
+
+
+
+

Folding

+

Stock PyTorch provids constant propagation and BatchNormalization folding. These optimizations are automatically applied to the jit model by invoking torch.jit.freeze. Take the Resnet50 as an example:

+
import torch
+import torchvision.models as models
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+x = torch.randn(4, 3, 224, 224)
+
+with torch.no_grad():
+    model = torch.jit.trace(model, x, check_trace=False).eval()
+    # Fold the BatchNormalization and propagate constant
+    torch.jit.freeze(model)
+    # Print the graph
+    print(model.graph_for(x))
+
+print("Execution finished")
+
+
+

If the model owner does not invoke the torch.jit.freeze, the BatchNormalization still exists on the graph. Otheriwse, the BatchNormalization will be folded on the graph to save the compuation and then improve the performance. Refer to the Constant Folding Wikipedia page for more details.

+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/hypertune.html b/cpu/2.4.0+cpu/tutorials/features/hypertune.html new file mode 100644 index 000000000..3cf4adfec --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/hypertune.html @@ -0,0 +1,351 @@ + + + + + + + HyperTune (Prototype) — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

HyperTune (Prototype)

+

HyperTune

+

HyperTune is an prototype feature to perform hyperparameter/execution configuration searching. The searching is used in various areas such as optimization of hyperparameters of deep learning models. The searching is extremely useful in real situations when the number of hyperparameters, including configuration of script execution, and their search spaces are huge that manually tuning these hyperparameters/configuration is impractical and time consuming. Hypertune automates this process of execution configuration searching for the launcher and Intel® Extension for PyTorch*.

+
+

Usage of Hypertune

+
python -m intel_extension_for_pytorch.cpu.hypertune --conf-file <your_conf_file> <your_python_script> [args]
+
+
+

There are two things to provide Hypertune (1) <your_conf_file> .yaml file to define the hyperparameters and their search spaces (2) <your_python_script> as an optimization function.

+
+

your_conf_file

+

The .yaml file is used to define configuration of Hypertune. There are two main information needed: (1) hyperparameters to tune and their search spaces (2) tuning strategy. See comments below together with a sample .yaml file.

+
tuning:                                                        # optional.
+  strategy: grid                                               # optional. The tuning strategy. Default is grid. Must be one of {grid, random}.
+  max_trials: 100                                              # optional. Allowed number of trials. Default is 100. If given time, set max_trials to product of length of all search spaces to try all possible combinations of hyperparameters.
+
+output_dir: /path/to/saving/directory                          # optional. Directory to which the tuning history will be saved in record.csv file. Default is current working directory.
+
+hyperparams:                                                   # mandatory.
+  launcher:                                                    # optional.
+    hp: ['ncores_per_instance', 'ninstances']                  # mandatory. Mandatory if hyperparams.launcher is specified. Specify the launcher hyperparameters to tune.
+    ncores_per_instance: all_physical_cores                    # optional.  Search space of ncores_per_instance if chosen to tune. If not defined, default search space of ncore_per_instance is used.
+    ninstances:  [1]                                           # optional.  Search space of ninstances if chosen to tune. If not defined, default search space of ninstances is used.
+
+
+
+
+

Hyperparameters

+
+

Launcher Hyperparameters

+

Currently hypertune tunes for the following launcher hyperparameters:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
hyperparameterdefault valuedefault search spacesearch space format
ncores_per_instance-1all_logical_coresstr or list of int. str must be one of {'all_logical_cores', 'all_physical_cores'}
ninstances-1all_logical_coresstr or list of int. str must be one of {'all_logical_cores', 'all_physical_cores'}
use_all_nodesTrue[True, False] if num_nodes > 1 else [True]list of bool
use_logical_coresFalse[True, False] if is_hyperthreading_enabled else [False]list of bool
disable_numactlFalse[True, False]list of bool
disable_iompFalse[True, False]list of bool
malloctc['tc', 'je', 'pt']list of str. str must be in {'tc', 'je', 'pt'}
+
+
+

Defining hyperparameters and their search spaces

+
+

1. Defining hyperparameters to tune:

+

List the hyperparameters to tune in hp. For example, to tune all launcher hyperparameters:

+
hyperparams:
+  launcher:
+    hp: ['ncores_per_instance', 'ninstances', 'use_all_nodes', 'use_logical_cores', 'disable_numactl', 'disable_iomp', 'malloc']
+
+
+

For example, to tune only launcher ncores_per_instance:

+
hyperparams:
+  launcher:
+    hp: ['ncores_per_instance']
+
+
+

All other launcher hyperparameters (ninstances, use_all_nodes, use_logical_core, disable_numactl, disable_iomp, malloc) will not be tuned and instead will use the default value defined in the previous section.

+
+
+

2. Defining the search spaces of the hyperparameters:

+
+
+

Default search space

+

If you don’t specify the search space of a hyperparamter, then the default search space defined in the previous section will be used for the hyperparameters defined in hp. For example,

+
hyperparams:
+  launcher:
+    hp: ['malloc']
+
+
+

malloc will be tuned using its default search space, ['tc', 'je', 'pt']. All other launcher hyperparamters (ncores_per_instance, ninstances, use_all_nodes, use_logical_cores, disable_numactl, disable_iomp) will not be tuned and instead will use their default values.

+
+
+

User defined search space

+

Specify the search space of a hyperparameter. For example,

+
hyperparams:
+  launcher:
+    hp: ['ncores_per_instance', 'ninstances', 'malloc']
+    ninstances: [1]
+    ncore_per_instance: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+
+
+

ninstances and ncores_per_instance will use user defined spaces [1] and [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] respectively. malloc will use its default search space, ['tc', 'je', 'pt'].

+
+
+
+

<your_python_script>

+

This is the script as an optimization function.

+
    +
  • Step 1. Print the objective(s) you want to optimize. Make sure this is just an int or float to be minimized or maximized.

  • +
  • Step 2. Just before the objective(s), add print statement(s) of the @hypertune {'name': str, 'higher_is_better': bool, 'target_val': int or float}.

  • +
+
'name'                                     # mandatory. The name of your objective function.
+'higher_is_better'                         # optional. True if objective function is to be maximized, False if to be minimized. Default is False.
+'target_val'                               # optional. Target value of the objective function. Default is -float('inf')
+
+
+

Have a look at the example script.

+
+
+
+

Usage Examples

+

Tuning ncores_per_instance for minimum latency

+

Suppose we want to tune ncores_per_instance for a single instance to minimize latency for resnet50 on a machine with two Intel(R) Xeon(R) Platinum 8180M CPUs. Each socket has 28 physical cores and another 28 logical cores.

+

Run the following command with example.yaml and resnet50.py:

+
python -m intel_extension_for_pytorch.cpu.hypertune --conf_file <hypertune_directory>/example/example.yaml <hypertune_directory>/example/resnet50.py
+
+
+

Once search completes, it will print to terminal the best tune result and best tune configuration found. Below is an output for this example:

+
Best configuration found is: {'ncores_per_instance': 15, 'ninstances': 1, 'use_all_nodes': True, 'use_logical_cores': False, 'disable_numactl': False, 'disable_iomp': False, 'malloc': 'tc'}
+latency: 12.339081764221191
+
+
+

15 ncores_per_instance gave the minimum latency.

+

You will also find the tuning history in <output_dir>/record.csv. You can take a sample csv file as a reference.

+

Hypertune can also optimize multi-objective function. Add as many objectives as you would like to your script.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/int8_overview.html b/cpu/2.4.0+cpu/tutorials/features/int8_overview.html new file mode 100644 index 000000000..3150b86c0 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/int8_overview.html @@ -0,0 +1,321 @@ + + + + + + + Intel® Extension for PyTorch* optimizations for quantization — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Intel® Extension for PyTorch* optimizations for quantization

+

The quantization functionality in Intel® Extension for PyTorch* currently only supports post-training quantization. This tutorial introduces how the quantization works in the Intel® Extension for PyTorch* side.

+

We fully utilize Pytorch quantization components as much as possible, such as PyTorch Observer method. To make a PyTorch user be able to easily use the quantization API, API for quantization in Intel® Extension for PyTorch* is very similar to those in PyTorch. Intel® Extension for PyTorch* quantization supports a default recipe to automatically decide which operators should be quantized or not. This brings a satisfying performance and accuracy tradeoff.

+
+

Static Quantization

+
import intel_extension_for_pytorch as ipex
+from intel_extension_for_pytorch.quantization import prepare, convert
+
+
+
+

Define qconfig

+

Using the default qconfig(recommended):

+
qconfig = ipex.quantization.default_static_qconfig
+# equal to
+# QConfig(activation=HistogramObserver.with_args(reduce_range=False),
+#         weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)) 
+
+
+

or define your own qconfig as:

+
from torch.ao.quantization import MinMaxObserver, PerChannelMinMaxObserver, QConfig
+qconfig = QConfig(activation=MinMaxObserver.with_args(qscheme=torch.per_tensor_affine, dtype=torch.quint8),
+                  weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric))
+
+
+

Note: we fully use PyTorch observer methonds, so you can use a different PyTorch obsever methond to define the QConfig. For weight observer, we only support torch.qint8 dtype now.

+

Suggestion:

+
    +
  1. For activation observer, if using qscheme as torch.per_tensor_affine, torch.quint8 is preferred. If using qscheme as torch.per_tensor_symmetric, torch.qint8 is preferred. For weight observer, setting qscheme to torch.per_channel_symmetric can get a better accuracy.

  2. +
  3. If your CPU device doesn’t support VNNI, seting the observer’s reduce_range to True can get a better accuracy, such as skylake.

  4. +
+
+
+

Prepare Model and Do Calibration

+
# prepare model, do conv+bn folding, and init model quant_state.
+user_model = ...
+user_model.eval()
+example_inputs = ..
+prepared_model = prepare(user_model, qconfig, example_inputs=example_inputs, inplace=False)
+
+for x in calibration_data_set:
+    prepared_model(x)
+
+# Optional, if you want to tuning(performance or accuracy), you can save the qparams as json file which
+# including the quantization state, such as scales, zero points and inference dtype.
+# And then you can achange the json file's settings, loading the changed json file
+# to model which will override the model's original quantization's settings.  
+#  
+# prepared_model.save_qconf_summary(qconf_summary = "configure.json")
+# prepared_model.load_qconf_summary(qconf_summary = "configure.json")
+
+
+
+
+

Convert to Static Quantized Model and Deploy

+
# make sure the example_inputs's size is same as the real input's size 
+convert_model = convert(prepared_model)
+with torch.no_grad():
+    traced_model = torch.jit.trace(convert_model, example_input)
+    traced_model = torch.jit.freeze(traced_model)
+# for inference 
+y = traced_model(x)
+
+# or save the model to deploy
+
+# traced_model.save("quantized_model.pt")
+# quantized_model = torch.jit.load("quantized_model.pt")
+# quantized_model = torch.jit.freeze(quantized_model.eval())
+# ...
+
+
+
+
+
+

Dynamic Quantization

+
import intel_extension_for_pytorch as ipex
+from intel_extension_for_pytorch.quantization import prepare, convert
+
+
+
+

Define QConfig

+

Using the default qconfig(recommended):

+
dynamic_qconfig = ipex.quantization.default_dynamic_qconfig
+# equal to 
+# QConfig(activation=PlaceholderObserver.with_args(dtype=torch.float, is_dynamic=True),
+#         weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric))
+
+
+

or define your own qconfig as:

+
from torch.ao.quantization import MinMaxObserver, PlaceholderObserver, QConfig
+dynamic_qconfig = QConfig(activation = PlaceholderObserver.with_args(dtype=torch.float, is_dynamic=True),
+                          weight = MinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric))
+
+
+

Note: For weight observer, it only supports dtype torch.qint8, and the qscheme can only be torch.per_tensor_symmetric or torch.per_channel_symmetric. For activation observer, it only supports dtype torch.float, and the compute_dtype can be torch.quint8 or torch.qint8.

+

Suggestion:

+
    +
  1. For weight observer, setting qscheme to torch.per_channel_symmetric can get a better accuracy.

  2. +
  3. If your CPU device doesn’t support VNNI, setting the observer’s reduce_range to True can get a better accuracy, such as skylake.

  4. +
+
+
+

Prepare Model

+
prepared_model = prepare(user_model, dynamic_qconfig, example_inputs=example_inputs)
+
+
+
+
+
+

Convert to Dynamic Quantized Model and Deploy

+
# make sure the example_inputs's size is same as the real input's size
+convert_model = convert(prepared_model)
+# Optional: convert the model to traced model
+#with torch.no_grad():
+#    traced_model = torch.jit.trace(convert_model, example_input)
+#    traced_model = torch.jit.freeze(traced_model)
+
+# or save the model to deploy
+# traced_model.save("quantized_model.pt")
+# quantized_model = torch.jit.load("quantized_model.pt")
+# quantized_model = torch.jit.freeze(quantized_model.eval())
+# ...
+# for inference 
+y = convert_model(x)
+
+
+

Note: we only support the following ops to do dynamic quantization:

+
    +
  • torch.nn.Linear

  • +
  • torch.nn.LSTM

  • +
  • torch.nn.GRU

  • +
  • torch.nn.LSTMCell

  • +
  • torch.nn.RNNCell

  • +
  • torch.nn.GRUCell

  • +
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/int8_recipe_tuning_api.html b/cpu/2.4.0+cpu/tutorials/features/int8_recipe_tuning_api.html new file mode 100644 index 000000000..d9f8879a4 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/int8_recipe_tuning_api.html @@ -0,0 +1,411 @@ + + + + + + + INT8 Recipe Tuning API (Prototype) — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

INT8 Recipe Tuning API (Prototype)

+

This new API ipex.quantization.autotune supports INT8 recipe tuning by using Intel® Neural Compressor as the backend in Intel® Extension for PyTorch*. In general, we provid default recipe in Intel® Extension for PyTorch*, and we still recommend users to try out the default recipe first without bothering tuning. If the default recipe doesn’t bring about desired accuracy, users can use this API to tune for a more advanced receipe.

+

Users need to provide a fp32 model and some parameters required for tuning. The API will return a prepared model with tuned qconfig loaded.

+
+

Usage Example

+ +
+
+

Smooth Quantization Autotune

+
+

Algorithm: Auto-tuning of $\alpha$.

+

SmoothQuant method aims to split the quantization difficulty of weight and activation by using a fixed-value $\alpha$ for an entire model. However, as the distributions of activation outliers vary not only across different models but also across different layers within a model, we hereby propose a method to obtain layer-wise optimal $\alpha$ values with the ability to tune automatically. +Currently, both layer-wise and block-wise auto-tuning methods are supported and the default option is layer-wise. +In block-wise auto-tuning, layers within one block (e.g an OPTDecoderLayer) would share the same alpha value; users could set ‘do_blockwise’: True in auto_alpha_args to enable it.

+

Our proposed method consists of 8 major steps:

+
    +
  • Hook input minimum and maximum values of layers to be smoothed using register_forward_hook.

  • +
  • Find a list of layers on which smoothquant could be performed.

  • +
  • Generate a list of $\alpha$ values of a user-defined range and set a default $\alpha$ value.

  • +
  • Calculate smoothing factor using default $\alpha$ value, adjust parameters accordingly and forward the adjusted model given an input sample.

  • +
  • Perform per-channel quantization_dequantization of weights and per-tensor quantization_dequantization of activations to predict output.

  • +
  • Calculate the layer-wise/block-wise loss with respect to FP32 output, iterate the previous two steps given each $\alpha$ value and save the layer-wise/block-wise loss per alpha.

  • +
  • Apply criterion on input LayerNorm op and obtain the optimal alpha values of a single input sample.

  • +
  • Iterate the previous three steps over a number of input samples and save the layer-wise/block-wise optimal $\alpha$ values.

  • +
+

Multiple criteria (e.g min, max and mean) are supported to determine the $\alpha$ value of an input LayerNorm op of a transformer block. Both alpha range and criterion could be configured in auto_alpha_args.

+

In our experiments, an $\alpha$ range of [0.0, 1.0] with a step_size of 0.1 is found to be well-balanced one for the majority of models.

+
+
+

$\alpha$ Usage

+

There are two ways to apply smooth quantization: 1) using a fixed alpha for the entire model or 2) determining the alpha through auto-tuning.

+
+

Using a fixed alpha

+

To set a fixed alpha for the entire model, users can follow this example:

+
import intel_extension_for_pytorch as ipex
+smoothquant_args: {"alpha": 0.5, "folding": True}
+tuned_model = ipex.quantization.autotune(
+    model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args,
+)
+
+
+

smoothquant_args description: +“alpha”: a float value. Default is 0.5. +“folding”: whether to fold mul into the previous layer, where mul is required to update the input distribution during smoothing.

+
    +
  • True: Fold inserted mul into the previous layer in the model graph. IPEX will only insert mul for layers that can do folding.

  • +
  • False: Allow inserting mul to update the input distribution without folding in the graph explicitly. IPEX (version>=2.1) will fuse inserted mul automatically in the backend.

  • +
+
+
+

Determining the alpha through auto-tuning

+

Users can search for the best alpha at two levels: a) for the entire model, and b) for each layer/block.

+
    +
  1. Auto-tune the alpha for the entire model +The tuning process looks for the optimal alpha value from a list of alpha values provided by the user.

  2. +
+
+

Please note that, it may use a considerable amount of time as the tuning process applies each alpha to the entire model and uses the evaluation result on the entire dataset as the metric to determine the best alpha. +Here is an example:

+
+
import numpy as np
+smoothquant_args={"alpha": numpy.arange(0.0, 1.0, 0.1).tolist()}
+
+
+
    +
  1. Auto-tune the alpha for each layer/block +In this case, the tuning process searches the optimal alpha of each layer of the block by evaluating the loss with respect to FP32 output on a few batches of data. +Here is an example:

  2. +
+
smoothquant_args={
+    "alpha": "auto",
+    "auto_alpha_args"{
+        "init_alpha": 0.8, # baseline alpha-value for auto-tuning
+        "alpha_min": 0.8, # min value of auto-tuning alpha search space
+        "alpha_max": 0.99, # max value of auto-tuning alpha search space
+        "alpha_step": 0.01, # step_size of auto-tuning alpha search space
+        "shared_criterion": "mean", # criterion for input LayerNorm op of a transformer block
+        "enable_blockwise_loss": False, # whether to enable block-wise auto-tuning
+    }
+}
+
+
+
import torch
+from torch import nn
+from torch.utils.data import DataLoader
+from torchvision import datasets
+from torchvision.transforms import ToTensor
+
+import intel_extension_for_pytorch as ipex
+
+########################################################################  # noqa F401
+# Reference for training portion:
+# https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html
+
+# Download training data from open datasets.
+training_data = datasets.FashionMNIST(
+    root="data",
+    train=True,
+    download=True,
+    transform=ToTensor(),
+)
+
+# Download test data from open datasets.
+test_data = datasets.FashionMNIST(
+    root="data",
+    train=False,
+    download=True,
+    transform=ToTensor(),
+)
+batch_size = 64
+
+# Create data loaders.
+train_dataloader = DataLoader(training_data, batch_size=batch_size)
+test_dataloader = DataLoader(test_data, batch_size=1)
+
+for X, y in test_dataloader:
+    print(f"Shape of X [N, C, H, W]: {X.shape}")
+    print(f"Shape of y: {y.shape} {y.dtype}")
+    break
+
+
+# Define model
+class NeuralNetwork(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.flatten = nn.Flatten()
+        self.linear_relu_stack = nn.Sequential(
+            nn.Linear(28 * 28, 512),
+            nn.ReLU(),
+            nn.Linear(512, 512),
+            nn.ReLU(),
+            nn.Linear(512, 10),
+        )
+
+    def forward(self, x):
+        x = self.flatten(x)
+        logits = self.linear_relu_stack(x)
+        return logits
+
+
+model = NeuralNetwork()
+loss_fn = nn.CrossEntropyLoss()
+optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
+
+
+def train(dataloader, model, loss_fn, optimizer):
+    size = len(dataloader.dataset)
+    model.train()
+    for batch, (X, y) in enumerate(dataloader):
+
+        # Compute prediction error
+        pred = model(X)
+        loss = loss_fn(pred, y)
+
+        # Backpropagation
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+
+        if batch % 100 == 0:
+            loss, current = loss.item(), batch * len(X)
+            print(f"loss: {loss:>7f}    [{current:>5d}/{size:>5d}]")
+
+
+model, optimizer = ipex.optimize(model, optimizer=optimizer)
+
+epochs = 5
+for t in range(epochs):
+    print(f"Epoch {t+1}\n-------------------------------")
+    train(train_dataloader, model, loss_fn, optimizer)
+print("Done!")
+
+################################ QUANTIZE ##############################  # noqa F401
+model.eval()
+
+
+def evaluate(dataloader, model):
+    size = len(dataloader.dataset)
+    model.eval()
+    accuracy = 0
+    with torch.no_grad():
+        for X, y in dataloader:
+            # X, y = X.to('cpu'), y.to('cpu')
+            pred = model(X)
+            accuracy += (pred.argmax(1) == y).type(torch.float).sum().item()
+    accuracy /= size
+    return accuracy
+
+
+######################## recipe tuning with INC ########################  # noqa F401
+def eval(prepared_model):
+    accu = evaluate(test_dataloader, prepared_model)
+    return float(accu)
+
+
+tuned_model = ipex.quantization.autotune(
+    model,
+    test_dataloader,
+    eval_func=eval,
+    sampling_sizes=[100],
+    accuracy_criterion={"relative": 0.01},
+    tuning_time=0,
+)
+########################################################################  # noqa F401
+
+# run tuned model
+data = torch.randn(1, 1, 28, 28)
+convert_model = ipex.quantization.convert(tuned_model)
+with torch.no_grad():
+    traced_model = torch.jit.trace(convert_model, data)
+    traced_model = torch.jit.freeze(traced_model)
+    traced_model(data)
+
+# save tuned qconfig file
+tuned_model.save_qconf_summary(qconf_summary="tuned_conf.json")
+
+print("Execution finished")
+
+
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/isa_dynamic_dispatch.html b/cpu/2.4.0+cpu/tutorials/features/isa_dynamic_dispatch.html new file mode 100644 index 000000000..842a2a59d --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/isa_dynamic_dispatch.html @@ -0,0 +1,763 @@ + + + + + + + ISA Dynamic Dispatching — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

ISA Dynamic Dispatching

+

This document explains the dynamic kernel dispatch mechanism for Intel® Extension for PyTorch* (IPEX) based on CPU ISA. It is an extension to the similar mechanism in PyTorch.

+
+

Overview

+

IPEX dyndisp is forked from PyTorch: ATen/native/DispatchStub.h and ATen/native/DispatchStub.cpp. IPEX adds additional CPU ISA level support, such as AVX512_VNNI, AVX512_BF16 and AMX.

+

PyTorch & IPEX CPU ISA support statement:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DEFAULTAVX2AVX2_VNNIAVX512AVX512_VNNIAVX512_BF16AMXAVX512_FP16
PyTorch
IPEX-1.11
IPEX-1.12
IPEX-1.13
IPEX-2.1
IPEX-2.2
IPEX-2.3

* Current IPEX DEFAULT level implemented as same as AVX2 level.

+
+

CPU ISA build compiler requirement

+

| ISA Level | GCC requirement | +| —- | —- | +| AVX2 | Any | +| AVX512 | GCC 9.2+ | +| AVX512_VNNI | GCC 9.2+ | +| AVX512_BF16 | GCC 10.3+ | +| AVX2_VNNI | GCC 11.2+ | +| AMX | GCC 11.2+ | +| AVX512_FP16 | GCC 12.1+ |

+

* Check with cmake/Modules/FindAVX.cmake for detailed compiler checks.

+
+
+
+

Dynamic Dispatch Design

+

Dynamic dispatch copies the kernel implementation source files to multiple folders for each ISA level. It then builds each file using its ISA specific parameters. Each generated object file will contain its function body (Kernel Implementation).

+

Kernel Implementation uses an anonymous namespace so that different CPU versions won’t conflict.

+

Kernel Stub is a “virtual function” with polymorphic kernel implementations pertaining to ISA levels.

+

At the runtime, Dispatch Stub implementation will check CPUIDs and OS status to determins which ISA level pointer best matches the function body.

+
+

Code Folder Struct

+
+

Kernel implementation: csrc/cpu/aten/kernels/xyzKrnl.cpp

+
+
+

Kernel Stub: csrc/cpu/aten/xyz.cpp and csrc/cpu/aten/xyz.h

+
+
+

Dispatch Stub implementation: csrc/cpu/dyndisp/DispatchStub.cpp and csrc/cpu/dyndisp/DispatchStub.h

+
+
+
+

CodeGen Process

+

IPEX build system will generate code for each ISA level with specifiy complier parameters. The CodeGen script is located at cmake/cpu/IsaCodegen.cmake.

+

The CodeGen will copy each cpp files from Kernel implementation, and then add ISA level as new file suffix.

+
+

Sample:

+
+

Origin file:

+

csrc/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp

+

Generate files:

+

DEFAULT: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.DEFAULT.cpp -O3 -D__AVX__ -DCPU_CAPABILITY_AVX2 -mavx2 -mfma -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -DCPU_CAPABILITY=DEFAULT -DCPU_CAPABILITY_DEFAULT

+

AVX2: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX2.cpp -O3 -D__AVX__ -mavx2 -mfma -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -DCPU_CAPABILITY=AVX2 -DCPU_CAPABILITY_AVX2

+

AVX512: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512.cpp -O3 -D__AVX512F__ -mavx512f -mavx512bw -mavx512vl -mavx512dq -mfma -DCPU_CAPABILITY=AVX512 -DCPU_CAPABILITY_AVX512

+

AVX512_VNNI: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_VNNI.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mfma -DCPU_CAPABILITY=AVX512_VNNI -DCPU_CAPABILITY_AVX512_VNNI

+

AVX512_BF16: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_BF16.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -DCPU_CAPABILITY=AVX512_BF16 -DCPU_CAPABILITY_AVX512_BF16

+

AMX: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AMX.cpp -O3  -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -DCPU_CAPABILITY=AMX -DCPU_CAPABILITY_AMX

+

AVX512_FP16: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_FP16.cpp -O3  -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -mavx512fp16 -DCPU_CAPABILITY_AMX -DCPU_CAPABILITY=AVX512_FP16 -DCPU_CAPABILITY_AVX512_FP16

+
+
+
+

Note:

+
    +
  1. DEFAULT level kernels is not fully implemented in IPEX. In order to align to PyTorch, we build default use AVX2 parameters in stead of that. So, IPEX minimal required executing machine support AVX2.

  2. +
  3. -D__AVX__ and -D__AVX512F__ is defined for depends library sleef .

  4. +
  5. -DCPU_CAPABILITY_AVX512 and -DCPU_CAPABILITY_AVX2 are must to be defined for PyTorch: aten/src/ATen/cpu/vec, it determins vec register width.

  6. +
  7. -DCPU_CAPABILITY=[ISA_NAME] is must to be defined for PyTorch: aten/src/ATen/cpu/vec, it is used as inline namespace name.

  8. +
  9. Higher ISA level is compatible to lower ISA levels, so it needs to contains level ISA feature definitions. Such as AVX512_BF16 need contains -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI. But AVX512 don’t contains AVX2 definitions, due to there are different vec register width.

  10. +
+
+
+
+
+

Add Custom Kernel

+

If you want to add a new custom kernel, and the kernel uses CPU ISA instructions, refer to these tips:

+
    +
  1. Add CPU ISA related kernel implementation to the folder: csrc/cpu/aten/kernels/NewKernelKrnl.cpp

  2. +
  3. Add kernel stub to the folder: csrc/cpu/aten/NewKernel.cpp

  4. +
  5. Include header file: csrc/cpu/dyndisp/DispatchStub.h, and reference to the comment in the header file.

  6. +
+
// Implements instruction set specific function dispatch.
+//
+// Kernels that may make use of specialized instruction sets (e.g. AVX2) are
+// compiled multiple times with different compiler flags (e.g. -mavx2). A
+// DispatchStub contains a table of function pointers for a kernel. At runtime,
+// the fastest available kernel is chosen based on the features reported by
+// cpuinfo.
+//
+// Example:
+//
+// In csrc/cpu/aten/MyKernel.h:
+//   using fn_type = void(*)(const Tensor& x);
+//   IPEX_DECLARE_DISPATCH(fn_type, stub);
+//
+// In csrc/cpu/aten/MyKernel.cpp
+//   IPEX_DEFINE_DISPATCH(stub);
+//
+// In csrc/cpu/aten/kernels/MyKernel.cpp:
+//   namespace {
+//     // use anonymous namespace so that different cpu versions won't conflict
+//     void kernel(const Tensor& x) { ... }
+//   }
+//   IPEX_REGISTER_DISPATCH(stub, &kernel);
+//
+// To call:
+//   stub(kCPU, tensor);
+
+
+
    +
  1. Write the kernel follow the guide. It contains: declare function type, register stub, call stub, etc.

  2. +
+
+

Note:

+
    +
  1. Some kernels only call oneDNN or iDeep implementation, or other backend implementation, which is not needed to add kernel implementations. (Refer: BatchNorm.cpp)

  2. +
  3. Vec related header file must be included in kernel implementation files, but can not be included in kernel stub. Kernel stub is common code for all ISA level, and can’t pass ISA related compiler parameters.

  4. +
  5. For more intrinsics, check the Intel® Intrinsics Guide.

  6. +
+
+
+

ISA intrinics specific kernel example:

+

This is a FP32 convert to BF16 function example, and it is implemented for AVX512_BF16, AVX512 and DEFAULT ISA levels.

+
//csrc/cpu/aten/CvtFp32ToBf16.h
+
+#pragma once
+
+#include <dyndisp/DispatchStub.h>
+
+namespace torch_ipex {
+namespace cpu {
+
+void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len);
+
+namespace {
+
+void cvt_fp32_to_bf16_kernel_impl(at::BFloat16* dst, const float* src, int len);
+
+}
+
+using cvt_fp32_to_bf16_kernel_fn = void (*)(at::BFloat16*, const float*, int);
+IPEX_DECLARE_DISPATCH(cvt_fp32_to_bf16_kernel_fn, cvt_fp32_to_bf16_kernel_stub);
+} // namespace cpu
+} // namespace torch_ipex
+
+
+
//csrc/cpu/aten/CvtFp32ToBf16.cpp
+
+#include "CvtFp32ToBf16.h"
+
+namespace torch_ipex {
+namespace cpu {
+
+IPEX_DEFINE_DISPATCH(cvt_fp32_to_bf16_kernel_stub);
+
+void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len) {
+  return cvt_fp32_to_bf16_kernel_stub(kCPU, dst, src, len);
+}
+
+} // namespace cpu
+} // namespace torch_ipex
+
+
+

Macro CPU_CAPABILITY_AVX512 and CPU_CAPABILITY_AVX512_BF16 are defined by compiler check, it is means that current compiler havs capability to generate defined ISA level code.

+

Because of AVX512_BF16 is higher level than AVX512, and it compatible to AVX512. CPU_CAPABILITY_AVX512_BF16 can be contained in CPU_CAPABILITY_AVX512 region.

+
//csrc/cpu/aten/kernels/CvtFp32ToBf16Krnl.cpp
+
+#include <ATen/cpu/vec/vec.h>
+#include "csrc/aten/cpu/CvtFp32ToBf16.h"
+
+namespace torch_ipex {
+namespace cpu {
+
+namespace {
+
+#if defined(CPU_CAPABILITY_AVX512)
+#include <ATen/cpu/vec/vec512/vec512.h>
+#else
+#include <ATen/cpu/vec/vec256/vec256.h>
+#endif
+using namespace at::vec;
+
+#if defined(CPU_CAPABILITY_AVX512)
+#include <immintrin.h>
+
+inline __m256i _cvt_fp32_to_bf16(const __m512 src) {
+#if (defined CPU_CAPABILITY_AVX512_BF16) // AVX512_BF16 ISA implementation.
+  return reinterpret_cast<__m256i>(_mm512_cvtneps_pbh(src));
+#else  // AVX512 ISA implementation.
+  __m512i value = _mm512_castps_si512(src);
+  __m512i nan = _mm512_set1_epi32(0xffff);
+  auto mask_value = _mm512_cmp_ps_mask(src, src, _CMP_ORD_Q);
+  __m512i ones = _mm512_set1_epi32(0x1);
+  __m512i vec_bias = _mm512_set1_epi32(0x7fff);
+  // uint32_t lsb = (input >> 16) & 1;
+  auto t_value = _mm512_and_si512(_mm512_srli_epi32(value, 16), ones);
+  // uint32_t rounding_bias = 0x7fff + lsb;
+  t_value = _mm512_add_epi32(t_value, vec_bias);
+  // input += rounding_bias;
+  t_value = _mm512_add_epi32(t_value, value);
+  // input = input >> 16;
+  t_value = _mm512_srli_epi32(t_value, 16);
+  // Check NaN before converting back to bf16
+  t_value = _mm512_mask_blend_epi32(mask_value, nan, t_value);
+  return _mm512_cvtusepi32_epi16(t_value);
+#endif
+}
+
+void cvt_fp32_to_bf16_kernel_impl(
+    at::BFloat16* dst,
+    const float* src,
+    int len) {
+  int i = 0;
+  for (; i < len - 15; i += 16) {
+    auto f32 = _mm512_loadu_ps(src + i);
+    _mm256_storeu_si256((__m256i*)(dst + i), _cvt_fp32_to_bf16(f32));
+  }
+  if (i < len) {
+    auto mask = (1 << (len - i)) - 1;
+    auto f32 = _mm512_maskz_loadu_ps(mask, src + i);
+    _mm256_mask_storeu_epi16(dst + i, mask, _cvt_fp32_to_bf16(f32));
+  }
+}
+
+#else // DEFAULT ISA implementation.
+
+void cvt_fp32_to_bf16_kernel_impl(
+    at::BFloat16* dst,
+    const float* src,
+    int len) {
+  for (int j = 0; j < len; j++) {
+    *(dst + j) = *(src + j);
+  }
+}
+
+#endif
+
+} // anonymous namespace
+
+IPEX_REGISTER_DISPATCH(cvt_fp32_to_bf16_kernel_stub, &cvt_fp32_to_bf16_kernel_impl);
+
+} // namespace cpu
+} // namespace torch_ipex
+
+
+
+
+

Vec specific kernel example:

+

This example shows how to get the data type size and its Vec size. In different ISA, Vec has a different register width and a different Vec size.

+
//csrc/cpu/aten/GetVecLength.h
+#pragma once
+
+#include <dyndisp/DispatchStub.h>
+
+namespace torch_ipex {
+namespace cpu {
+
+std::tuple<int, int> get_cpp_typesize_and_vecsize(at::ScalarType dtype);
+
+namespace {
+
+std::tuple<int, int> get_cpp_typesize_and_vecsize_kernel_impl(
+    at::ScalarType dtype);
+}
+
+using get_cpp_typesize_and_vecsize_kernel_fn =
+    std::tuple<int, int> (*)(at::ScalarType);
+IPEX_DECLARE_DISPATCH(
+    get_cpp_typesize_and_vecsize_kernel_fn,
+    get_cpp_typesize_and_vecsize_kernel_stub);
+
+} // namespace cpu
+} // namespace torch_ipex
+
+
+
//csrc/cpu/aten/GetVecLength.cpp
+
+#include "GetVecLength.h"
+
+namespace torch_ipex {
+namespace cpu {
+
+IPEX_DEFINE_DISPATCH(get_cpp_typesize_and_vecsize_kernel_stub);
+
+// get cpp typesize and vectorsize by at::ScalarType
+std::tuple<int, int> get_cpp_typesize_and_vecsize(at::ScalarType dtype) {
+  return get_cpp_typesize_and_vecsize_kernel_stub(kCPU, dtype);
+}
+
+} // namespace cpu
+} // namespace torch_ipex
+
+
+
//csrc/cpu/aten/kernels/GetVecLengthKrnl.cpp
+
+#include <ATen/cpu/vec/vec.h>
+#include "csrc/cpu/aten/GetVecLength.h"
+
+namespace torch_ipex {
+namespace cpu {
+
+namespace {
+
+std::tuple<int, int> get_cpp_typesize_and_vecsize_kernel_impl(
+    at::ScalarType dtype) {
+  switch (dtype) {
+    case at::ScalarType::Double:
+      return std::make_tuple(
+          sizeof(double), at::vec::Vectorized<double>::size());
+    case at::ScalarType::Float:
+      return std::make_tuple(sizeof(float), at::vec::Vectorized<float>::size());
+    case at::ScalarType::ComplexDouble:
+      return std::make_tuple(
+          sizeof(c10::complex<double>),
+          at::vec::Vectorized<c10::complex<double>>::size());
+    case at::ScalarType::ComplexFloat:
+      return std::make_tuple(
+          sizeof(c10::complex<float>),
+          at::vec::Vectorized<c10::complex<float>>::size());
+    case at::ScalarType::BFloat16:
+      return std::make_tuple(
+          sizeof(decltype(
+              c10::impl::ScalarTypeToCPPType<at::ScalarType::BFloat16>::t)),
+          at::vec::Vectorized<decltype(c10::impl::ScalarTypeToCPPType<
+                                       at::ScalarType::BFloat16>::t)>::size());
+    case at::ScalarType::Half:
+      return std::make_tuple(
+          sizeof(decltype(
+              c10::impl::ScalarTypeToCPPType<at::ScalarType::Half>::t)),
+          at::vec::Vectorized<decltype(c10::impl::ScalarTypeToCPPType<
+                                       at::ScalarType::Half>::t)>::size());
+    default:
+      TORCH_CHECK(
+          false,
+          "Currently only floating and complex ScalarType are supported.");
+  }
+}
+
+} // anonymous namespace
+
+IPEX_REGISTER_DISPATCH(
+    get_cpp_typesize_and_vecsize_kernel_stub,
+    &get_cpp_typesize_and_vecsize_kernel_impl);
+
+} // namespace cpu
+} // namespace torch_ipex
+
+
+
+
+
+

Private Debug APIs

+

Here are three ISA-related private APIs that can help debugging::

+
    +
  1. Query current ISA level.

  2. +
  3. Query max CPU supported ISA level.

  4. +
  5. Query max binary supported ISA level.

  6. +
+
+

Note:

+
    +
  1. Max CPU supported ISA level only depends on CPU features.

  2. +
  3. Max binary supported ISA level only depends on built complier version.

  4. +
  5. Current ISA level, it is the smaller of max CPU ISA level and max binary ISA level.

  6. +
+
+
+

Example:

+
python
+Python 3.9.7 (default, Sep 16 2021, 13:09:58)
+[GCC 7.5.0] :: Anaconda, Inc. on linux
+Type "help", "copyright", "credits" or "license" for more information.
+>>> import intel_extension_for_pytorch._C as core
+>>> core._get_current_isa_level()
+'AMX'
+>>> core._get_highest_cpu_support_isa_level()
+'AMX'
+>>> core._get_highest_binary_support_isa_level()
+'AMX'
+>>> quit()
+
+
+
+
+
+

Select ISA level manually.

+

By default, IPEX dispatches to the kernels with the maximum ISA level supported by the underlying CPU hardware. This ISA level can be overridden by the environment variable ATEN_CPU_CAPABILITY (same environment variable as PyTorch). The available values are {avx2, avx512, avx512_vnni, avx512_bf16, amx, avx512_fp16}. The effective ISA level would be the minimal level between ATEN_CPU_CAPABILITY and the maximum level supported by the hardware.

+
+

Example:

+
$ python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())'
+AMX
+$ ATEN_CPU_CAPABILITY=avx2 python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())'
+AVX2
+
+
+
+

Note:

+

core._get_current_isa_level() is an IPEX internal function used for checking the current effective ISA level. It is used for debugging purpose only and subject to change.

+
+
+
+
+

CPU feature check

+

An addtional CPU feature check tool in the subfolder: tests/cpu/isa

+
$ cmake .
+-- The C compiler identification is GNU 11.2.1
+-- The CXX compiler identification is GNU 11.2.1
+-- Detecting C compiler ABI info
+-- Detecting C compiler ABI info - done
+-- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/cc - skipped
+-- Detecting C compile features
+-- Detecting C compile features - done
+-- Detecting CXX compiler ABI info
+-- Detecting CXX compiler ABI info - done
+-- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/c++ - skipped
+-- Detecting CXX compile features
+-- Detecting CXX compile features - done
+-- Configuring done
+-- Generating done
+-- Build files have been written to: tests/cpu/isa
+$ make
+[ 33%] Building CXX object CMakeFiles/cpu_features.dir/intel_extension_for_pytorch/csrc/cpu/isa/cpu_feature.cpp.o
+[ 66%] Building CXX object CMakeFiles/cpu_features.dir/intel_extension_for_pytorch/csrc/cpu/isa/cpu_feature_main.cpp.o
+[100%] Linking CXX executable cpu_features
+[100%] Built target cpu_features
+$ ./cpu_features
+XCR0: 00000000000602e7
+os --> avx: true
+os --> avx2: true
+os --> avx512: true
+os --> amx: true
+mmx:                    true
+sse:                    true
+sse2:                   true
+sse3:                   true
+ssse3:                  true
+sse4_1:                 true
+sse4_2:                 true
+aes_ni:                 true
+sha:                    true
+xsave:                  true
+fma:                    true
+f16c:                   true
+avx:                    true
+avx2:                   true
+avx_vnni:                       true
+avx512_f:                       true
+avx512_cd:                      true
+avx512_pf:                      false
+avx512_er:                      false
+avx512_vl:                      true
+avx512_bw:                      true
+avx512_dq:                      true
+avx512_ifma:                    true
+avx512_vbmi:                    true
+avx512_vpopcntdq:                       true
+avx512_4fmaps:                  false
+avx512_4vnniw:                  false
+avx512_vbmi2:                   true
+avx512_vpclmul:                 true
+avx512_vnni:                    true
+avx512_bitalg:                  true
+avx512_fp16:                    true
+avx512_bf16:                    true
+avx512_vp2intersect:                    true
+amx_bf16:                       true
+amx_tile:                       true
+amx_int8:                       true
+prefetchw:                      true
+prefetchwt1:                    false
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/nhwc.html b/cpu/2.4.0+cpu/tutorials/features/nhwc.html new file mode 100644 index 000000000..7742d5e6b --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/nhwc.html @@ -0,0 +1,391 @@ + + + + + + + Channels Last — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Channels Last

+
+

What is Channels Last

+

Note: In PyTorch, memory format refers to data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space. Memory format has the same semantic meaning as layout in oneDNN. Layout in PyTorch has other semantic of describing dense or sparse with the attributes: ‘torch.strided’, ‘torch.sparse_coo’.

+

On CNN models, the canonical order of tensor dimensions is assigned with semantic meaning. For example the input tensor of 2D convolution is of NCHW by default on PyTorch - <batch_size, channels, height, width>. NHWC is an alternative way of describing the tensor dimensions - <batch_size, height, width, channels>.

+

Look at the following image of illustrating NCHW and NHWC when N=1. Actually when N=1, NHWC has the same format with BMP file image. +fig-1-memory-layout

+

PyTorch refers to NCHW as torch.contiguous_format (the default memory format) and to NHWC as torch.channels_last, which is a new feature as of the 1.5 release.

+

TensorFlow uses NHWC as the default memory format because NHWC has a performance advantage over NCHW. On CPU platforms, we propose to optimize Channels Last memory path for the following reasons:

+
    +
  • Performance - NHWC performance is not as good as blocked memory format (nChw16c), but it is close, and much better performance than NCHW.

  • +
  • User Experience - Operator coverage of NHWC would be higher than blocked memory format (to_mkldnn() method), so user experience is better. To be specific, it is difficult to enable operators that manipulates dim on blocked format such as sum(dim=?). You would need to convert tensor from blocked memory format back to NHWC using to_dense(), before feeding it into sum(). This is naturally supported on Channels Last memory format already.

  • +
  • Upstream - Will be easier since CPU doesn’t hold secret ingredient and both inference and training will be covered.

  • +
+
+
+

Memory Format Is All That Matters

+

On CNN models, memory format is almost the foundation of any upper level design. One important fact is that converting memory format could be very expensive. Thus, in case that multiple CNN operators are performed in sequence, e.g. Conv2d -> ReLU -> Conv2d, it’s beneficial to transform them from different memory formats once, do computation and reorder them back.

+

On PyTorch, you can use 3 types of memory formats on CNN models:

+
+

a. NCHW (default)

+
## NB: internally blocked format will still be used.
+##   aka. we do 'reorder' for 'input', 'weight' and 'output',
+##   and believe me this is expensive, roughly 50% perf loss...
+input = torch.randn(1, 10, 32, 32)
+model = torch.nn.Conv2d(10, 20, 1, 1)
+output = model(input)
+
+
+
+
+

b. NHWC (WIP for CPU)

+
input = torch.randn(1, 10, 32, 32)
+model = torch.nn.Conv2d(10, 20, 1, 1)
+## NB: convert to Channels Last memory format.
+##   oneDNN supports NHWC for feature maps (input, output),
+##   but weight still needs to be of blocked format.
+##   Still we can save reorders for feature maps.
+input = input.to(memory_format=torch.channels_last)
+model = model.to(memory_format=torch.channels_last)
+output = model(input)
+
+
+
+
+

c. Blocked (nChw16c)

+
from torch.utils import mkldnn as mkldnn_utils
+input = torch.randn(1, 10, 32, 32)
+model = torch.nn.Conv2d(10, 20, 1, 1)
+## NB: convert to blocked memory format.
+##   Note that 'output' is in blocked memory format,
+##   in case the subsequent operator doesn't support blocked memory format
+##   you need to manually reorder it back to NCHW by output.to_dense()
+##   mkldnn_utils.to_mkldnn(model) is used to prepack the weight, this will save weight reorder time
+##   for inference. For training, it is not needed.
+input = input.to_mkldnn()
+model = mkldnn_utils.to_mkldnn(model)
+output = model(input)
+
+
+

Better to explain the concepts here with a diagram, the dotted lines indicate simple memory view, no hard copy. +fig-2(1)-pt-conv-layout-path-dispatch

+

Conclusion is that NHWC path saves the reorders from feature maps compared with NCHW path, but still weight reorder is necessary since oneDNN requires weights to be in blocked memory format. From performance perspective, when batch_size=N, weight reorder is minimum compared to feature map reorder. But when batch_size=1, weight reorder is usually not negligible. So whether to enable weight prepacking on channels last memory format needs further discussion.

+
+
+
+

PyTorch Strided Layout

+

Before moving on, I feel it is necessary to explain how PyTorch organizes tensors in memory - the layout. Here we only focus on dense tensors, skip ‘coo’ layout of sparse tensor.

+

The question itself can be reinterpreted as for a tensor of size <N, C, H, W>, how does PyTorch access the element with index <n, c, h, w> from memory, the answer is stride:

+
tensor: <N, C, H, W>
+index: <n, c, h, w>
+strides: <CHW, HW, W, 1>
+offset(n,c,h,w) = stride_n * n + stride_c * c + stride_h * h + stride_w * w
+                = CHW * n + HW * c + W * h + 1 * w
+
+
+

One merit of introducing stride is that it can express noncontiguous tensors, e.g. a slice of big tensor. For example, the ‘Xs’ in the following image have a stride of <n1+n2, 1>.

+

fig-3-pytorch-strided-layout

+

Keep in mind that PyTorch Tensor does not have an attribute called ‘memory_format’ or something else. The memory format expression completely relies on size and stride. The design principle can be found at reference: RFC: Memory format (aka layout aka NHWC) support. No matter what the tensor’s memory format is, we need a logical canonical order for the dimensions - that is NCHW on PyTorch. Thus, size and stride are ALWAYS described in the order of NCHW. Let’s now look at the Channels Last case of the previous question:

+
tensor: <N, C, H, W>
+index: <n, c, h, w>
+strides: <HWC, 1, WC, C>
+offset(n,c,h,w) = stride_n * n + stride_c * c + stride_h * h + stride_w * w
+                = HWC * n + 1 * c + WC * h + C * w
+
+
+

Actually, this pattern applies to ALL other memory formats as long as it is 4-dim, e.g. strides for CHWN would be <1, HWN, WN, N>.

+
+
+

PyTorch Channels Last Memory Format APIs

+
+

a. tensor creation

+
x = torch.empty(N, C, H, W, memory_format=torch.channels_last)
+
+
+
+
+

b. tensor conversion

+
## .contiguous() transforms NHWC noncontiguous to NHWC contiguous.
+## .to() converts NCHW tensor to NHWC one, it is outplace.
+x = x.contiguous(memory_format=torch.channels_last)
+x = x.to(memory_format=torch.channels_last)
+
+## contiguous check
+x.is_contiguous(memory_format=torch.channels_last)
+
+
+
+
+

c. model conversion

+
## NB: tensor.to() is an outplace operation
+##   model.to() is inplace. It calls _apply() which is inplace.
+model = model.to(memory_format=torch.channels_last)
+input = input.to(memory_format=torch.channels_last)
+
+
+
+
+

d. operator coverage

+

Detailed operator coverage information has been listed at reference Operators-with-Channels-Last-support. In brief, ImageNet training topologies on GPU already have full support on Channels Last memory format, while CPU doesn’t.

+

Some spontaneous questions:

+
    +
  • How to tell whether this model or operator support Channels Last? - This requires manual memory format check, aka. ‘torch.channels_last’ input and weight shall NOT generate ‘torch.contiguous_format’ output.

  • +
  • What if the model comprises of operator not supported Channels Last? - No errors messages will be shown, the NHWC tensor will be handled by the operator as a non-contiguous NCHW tensor, so result might not be correct depending on the algorithm of this operator.

  • +
+
+
+
+

Writing Channels Last Kernels

+
+

a. Status on CPU

+
    +
  • No support - Requires to register Channels Last kernel for CPU path, e.g. Conv2d;

  • +
  • Explicit support - Already have Channels Last kernel for CPU path (in ATen native manner), need to compare oneDNN counterpart performance, e.g. BatchNorm;

  • +
  • Implicit support - Supported via meta structures like ‘TensorIterator’, need to compare oneDNN counterpart performance, e.g. ReLU.

  • +
+
+
+

b. Register Channels Last Kernel in ATen Native Manner

+

The general guideline has been listed under reference Writing-memory-format-aware-operators, not to repeat here. You may take one of my recent PR optimize upsample performance linear mode on CPU as an example, which also demonstrates NHWC performance advantage over NCHW because of the ease of vectorization.

+
+
+

c. Register oneDNN Kernel on Channels Last

+

Registering a oneDNN kernel under Channels Last memory format on CPU is no different from cuDNN: Only very few upper level changes are needed, such as accommodate ‘contiguous()’ to ‘contiguous(suggested_memory_format)’. The automatic reorder of oneDNN weight shall have been hidden in ideep.

+
+
+
+

oneDNN NHWC APIs

+

Compared to NCHW interfaces, 2 parts need to be addressed on NHWC interfaces:

+
+

a. Create NHWC Memory

+

The logical size and stride description of oneDNN is always in NCHW, this is identical to PyTorch. Example code such as

+
/* create md from memory::format_tag */
+auto src_md = memory::desc(
+        {N, C, H, W}, // logical dims, the order is defined by a primitive
+        memory::data_type::f32, // tensor's data type
+        memory::format_tag::nhwc // memory format, NHWC in this case
+);
+
+/* alternative: create md from strides */
+auto src_md = memory::desc(
+        {N, C, H, W}, // logical dims, the order is defined by a primitive
+        memory::data_type::f32, // tensor's data type
+        {stride_N, stride_C, stride_H, stride_W} // the strides
+);
+
+/* create memory */
+auto src_mem = memory(src_md, src_data_ptr, engine);
+
+
+
+
+

b. Create Convolution Primitive

+
    +
  • NCHW - create memory::desc with any card for ‘input’, ‘output’ and ‘weight’; query proposed memory::desc from convolution primitive;

  • +
  • NHWC - create memory::desc with format_tag::nhwc for ‘input’ and ‘output’, use any for ‘weight’; if we use hwio for ‘weight’ convolution primitive will be created with gemm rather jit avx512.

  • +
+
+
+
+

CPU Channels Last Targets

+
    +
  • User Experience - No special user level code change, only ‘input’ and ‘model’ conversion is required;

  • +
  • Scenarios - cover both training and inference;

  • +
  • Models - ResNet50 and ResNext101, extended targets: torchvision models, detectron2;

  • +
  • Performance Targets - training >0.8x blocked; inference throughput > 0.8x blocked; inference latency? (need further discussion)

  • +
  • Operator Coverage - No less than GPU path;

  • +
  • BFloat16 - This part shall align with BFloat16 integration (need further discussion);

  • +
  • int8 - Need further discussion.

  • +
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/optimizer_fusion.html b/cpu/2.4.0+cpu/tutorials/features/optimizer_fusion.html new file mode 100644 index 000000000..e74d831c8 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/optimizer_fusion.html @@ -0,0 +1,205 @@ + + + + + + + Optimizer Fusion — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Optimizer Fusion

+
+

Introduction

+

As with TorchScript, operation fusion reduces the number of operators that will be executed, and reduces overhead time. This methodology is also applied in ipex optimizer Optimization. We support Lamb/Adagrad/SGD fusion for both FP32/BF16(Split) at current stage.

+

Let’s use adagrad update as an example.

+
    if weight_decay != 0:
+        grad = grad.add(param, alpha=weight_decay)
+    clr = lr / (1 + (step - 1) * lr_decay)
+    state_sum.addcmul_(grad, grad, value=1)
+    std = state_sum.sqrt().add_(eps)
+    param.addcdiv_(grad, std, value=-clr)
+
+
+
+
+

Operation Fusion

+

One problem of the native implementation above is that we need to access the whole storage of “grad”, “parameters”, and “state sum” several times. For example, we need to access the whole storage of “parameters” and “grad” at the first clause. For large topologies, it is possible that the “grad” and “parameters” cannot be stored on the onboard CPU cache. When we need to access the storage of “grad” again when executing the third clause, the processor must read data out from memory again instead of the more efficient onboard high speed CPU cache. This is a memory-bound bottle neck preventing good performance.

+

Fusion is the methodology to solve this problem. Since the 5 clauses in the pseudo code are all element-wise operations. We can fuse them into a single one, like the pseudo code below.

+
   adagrad_fused_step(param, grad, state_sum, ...(other args))
+
+
+

In our fused operators, we can separate the storage of “grad”, “parameters” and “state sum” in several groups and ensure each group is small enough to be stored in the cache. The pseudo code below illustrates our execution process.

+
  grad = (grad0, grad1, ..., grad_n)
+  param = (param, param, ..., param_n)
+  state_sum = (state_sum, state_sum, ..., state_sum_n)
+  for i in range(n):
+    adagrad_step(grad_i, param_i, state_sum_i, ...(other_args))
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/runtime_extension.html b/cpu/2.4.0+cpu/tutorials/features/runtime_extension.html new file mode 100644 index 000000000..b3db992b7 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/runtime_extension.html @@ -0,0 +1,368 @@ + + + + + + + Runtime Extension — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Runtime Extension

+

Intel® Extension for PyTorch* Runtime Extension provides a couple of PyTorch frontend APIs for users to get finer-grained control of the thread runtime. It provides:

+
    +
  1. Multi-stream inference via the Python frontend module ipex.cpu.runtime.MultiStreamModule.

  2. +
  3. Spawn asynchronous tasks via the Python frontend module ipex.cpu.runtime.Task.

  4. +
  5. Program core bindings for OpenMP threads via the Python frontend ipex.cpu.runtime.pin.

  6. +
+

note: Intel® Extension for PyTorch* Runtime extension is in the prototype stage. The API is subject to change. More detailed descriptions are available at API Documentation page.

+
+

Requirements

+

Intel® Extension for PyTorch* Runtime Extension relies on intel omp to bind threads to cores. If you want to use it in your application, start model script with an extra flag: LD_PRELOAD=$LD_PRELOAD:$PATH/libiomp5.so python model_script.py.

+
+
+

Use Cases

+
+

Example of MultiStream Module

+

Runtime extension supports weight-sharing multi-stream inference for throughput mode on CPU. You need to convert the original model into multi-stream model and run the new multi-stream model as normal. The detailed description of parameters to create MultiStreamModule is available at API Documentation page.

+

MultiStreamModule can improve performance for inference in throughput mode. We suggest creating MultiStreamModule with num_streams of “AUTO”, which heuristically decides the number of streams. Usually, it provides a reasonable performance. However, it may not be optimal for some cases (refer to the section Performance recipes for details). Manual tuning for number of streams is needed.

+

The MultiStreamModule creates number of streams based on input parameter num_streams and bind cores to stream based on input parameter cpu_pool. If the number of cores inside cpu_pool is divisible by num_streams, the cores will be allocated equally to each stream. If the number of cores inside cpu_pool is not divisible by num_streams with remainder N, one extra core will be allocated to the first N streams. We suggest to set the num_streams as divisor of core number inside cpu_pool.

+

If the inputs’ batchsize is larger than and divisible by num_streams, the batchsize will be allocated equally to each stream. If batchsize is not divisible by num_streams with remainder N, one extra piece will be allocated to the first N streams. If the inputs’ batchsize is less than num_streams, only the first batchsize’s streams are used with mini batch as one. We suggest to set inputs’ batchsize larger than and divisible by num_streams. When creating MultiStreamModule, if you leave num of streams as “AUTO”, we suggest to set inputs’ batchsize larger than and divisible by number of cores.

+

Let’s create some ExampleNets that will be used by further examples:

+
import torch
+import intel_extension_for_pytorch as ipex
+
+class ExampleNet1(torch.nn.Module):
+    def __init__(self):
+        super(ExampleNet1, self).__init__()
+        self.conv = torch.nn.Conv2d(64, 128, (3, 3), stride=(2, 2), padding=(1, 1), bias=False)
+
+    def forward(self, x):
+        x1 = self.conv(x)
+        y = torch.flatten(x1, start_dim=1)
+        return y
+
+class ExampleNet2(torch.nn.Module):
+    def __init__(self):
+        super(ExampleNet2, self).__init__()
+        self.conv = torch.nn.Conv2d(64, 128, (3, 3), stride=(2, 2), padding=(1, 1), bias=False)
+        self.conv2 = torch.nn.Conv2d(64, 128, (3, 3), stride=(2, 2), padding=(1, 1), bias=False)
+
+    def forward(self, x1, x2):
+        y1 = self.conv(x1)
+        y2 = self.conv2(x2)
+        y = torch.flatten(y1, start_dim=1)
+        return y1, y
+
+model1 = ExampleNet1()
+model1.eval()
+x = torch.rand(16, 64, 3, 3)
+
+with torch.no_grad():
+    traced_model1 = torch.jit.trace(model1, x)
+    traced_model1 = torch.jit.freeze(traced_model1)
+
+model2 = ExampleNet2()
+model2.eval()
+x2 = torch.rand(16, 64, 3, 3)
+
+with torch.no_grad():
+    traced_model2 = torch.jit.trace(model2, (x, x2))
+    traced_model2 = torch.jit.freeze(traced_model2)
+
+
+
+

Examples1: Basic Usage

+

Here is the example of a model with single tensor input/output. We create a CPUPool with all the cores available on numa node 0. And creating a MultiStreamModule with stream number of 2 to do inference.

+
# Convert the model into multi_Stream_model
+cpu_pool = ipex.cpu.runtime.CPUPool(node_id=0)
+multi_Stream_model = ipex.cpu.runtime.MultiStreamModule(traced_model1, num_streams=2, cpu_pool=cpu_pool)
+
+with torch.no_grad():
+    y = multi_Stream_model(x)
+
+
+
+
+

Examples2: Usage with “AUTO” setting

+

When creating a MultiStreamModule, we have default settings for num_streams (”AUTO”) and cpu_pool (with all the cores available on numa node 0). For the num_streams of “AUTO”, there are limitations to use with int8 datatype as we mentioned in below performance receipts section.

+
# Convert the model into multi_Stream_model
+multi_Stream_model = ipex.cpu.runtime.MultiStreamModule(traced_model1)
+
+with torch.no_grad():
+    y = multi_Stream_model(x)
+
+
+
+
+

Examples3: Usage for models with structure inputs/outputs

+

For module such as ExampleNet2 with structure input/output tensors, user needs to create MultiStreamModuleHint as input hint and output hint. MultiStreamModuleHint tells MultiStreamModule how to auto split the input into streams and concat the output from each steam.

+
# Convert the model into multi_Stream_model
+cpu_pool = ipex.cpu.runtime.CPUPool(node_id=0)
+# Create the input hint object
+input_hint = ipex.cpu.runtime.MultiStreamModuleHint(0, 0)
+# Create the output hint object
+# When Python module has multi output tensors, it will be auto pack into a tuple, So we pass a tuple(0, 0) to create the output_hint
+output_hint = ipex.cpu.runtime.MultiStreamModuleHint((0, 0))
+multi_Stream_model = ipex.cpu.runtime.MultiStreamModule(traced_model2,
+                                                        num_streams=2,
+                                                        cpu_pool=cpu_pool,
+                                                        input_split_hint=input_hint,
+                                                        output_concat_hint=output_hint)
+
+with torch.no_grad():
+    y = multi_Stream_model(x, x2)
+
+
+
+
+

Performance recipes

+

There are two motivations to use the MultiStreamModule:

+
    +
  1. Better cache locality: With MultiStreamModule, the activations will be limited in the CPU cores allocated to this stream instead of the whole cpu_pool.

  2. +
  3. Reduce the OMP sync overhead: if one CPU core allocated to one stream, the whole execution needs to do OMP sync once after all streams finish execution instead of sync per layer.

  4. +
+

Thus, MultiStreamModule may benefit performance for inference in throughput mode. However, the end-to-end performance is impacted by these issues:

+
    +
  1. The kernels’ efficiency, which are different under different OMP threads’ number.

  2. +
  3. The overhead of inputs’ auto split and outputs’ auto concat for each stream.

  4. +
  5. The overhead of pthread (stream async execution) wakes up and threads’ synchronization after stream execution.

  6. +
+

Here are some performance receipes that we recommend for better multi-stream performance.

+
    +
  • When creating MultiStreamModule with torch.nn.Module as imperative path module, each stream inside MultiStreamModule suffers the GIL issue when doing inference together. This hurts end-to-end performance. We recommend creating MultiStreamModule with the torch.jit.ScriptModule.

  • +
  • For convolution network, intel_extension_for_pytorch has the quick path getting convolution primitive to mitigate overhead when OMP_NUM_THREADS is the same between the torch.jit.trace and model execution phases. To use this quick path for better performance, we recommend setting the OMP_NUM_THREADS environment before launching the model script. The recommended value of OMP_NUM_THREADS should equal the threads number used by each stream. For example, creating MultiStreamModule as stream number s1 and CPUPool with core number c1, each stream will allocate threads number as c1/s1. We recommend setting OMP_NUM_THREADS as this value.

  • +
  • Numactl and the threads management in MultiStreamModule work at different levels. MultiStreamModule has the thread affinity setting for each stream, which works in the thread level. However, for the Python modules outside the stream, such as the dataloader, are out of view for MultiStreamModule. As the result, we recommend using numactl -C core_ids -m node_id for the process level core and memory resource management. For the core resource setting by numactl, set it the same or superset of the core resource to create CPUPool. Otherwise, the behavior is undefined in current implementation.

  • +
+
+
+

Known issues

+
    +
  • Intel® Extension for PyTorch* runtime extension feature with Int8 data type does not support dynamic shape well. To avoid performance issues, we recommend setting the batchsize to do jit.trace with same mini batchsize used by each stream. For example, creating MultiStreamModule as stream number of s1 and input global batchsize as gb, each stream will inference with mini-batchsize of gb/s1. We should use this mini-batchsize value to do jit.trace. To be aware of the num_streams value, we recommend creating MultiStreamModule with num_streams setting explicitly instead of “AUTO”. Due to the same limitation, the behavior that each stream inference with different mini batchsize of int8 data type is undefined and not supported.

  • +
+
+
+
+

Example of asynchronous task

+

Here is an example for using asynchronous tasks. With the support of a runtime API, you can run 2 modules simultaneously. Each module runs on the corresponding cpu pool.

+
cpu_pool1 = ipex.cpu.runtime.CPUPool([0, 1, 2, 3])
+cpu_pool2 = ipex.cpu.runtime.CPUPool([4, 5, 6, 7])
+
+task1 = ipex.cpu.runtime.Task(traced_model1, cpu_pool1)
+task2 = ipex.cpu.runtime.Task(traced_model1, cpu_pool2)
+
+y1_future = task1(x)
+y2_future = task2(x)
+
+y1 = y1_future.get()
+y2 = y2_future.get()
+
+
+
+
+

Example of configuring core binding

+

Runtime Extension provides API of ipex.cpu.runtime.pin to a CPU Pool for binding physical cores. We can use it without the async task feature. Here is the example to use ipex.cpu.runtime.pin in the with context.

+
cpu_pool = ipex.cpu.runtime.CPUPool(node_id=0)
+with ipex.cpu.runtime.pin(cpu_pool):
+    y_runtime = traced_model1(x)
+
+
+
+
+
+

Detail Design

+
+

How the core binding is implemented

+

The Runtime Extension relies on the kmp_* API inside iomp share library to fulfill the core binding. During the initialization of async threads, kmp_* API functions are invoked internally to start up an OpenMP group with specified number of worker threads. Each worker thread is then bound to the designated physical core(s) inside this OpenMP group. After initialization, when you submit a task, the OpenMP group will serve the requested task.

+
+
+

Design of Task

+

Task is an abstraction of computation based on PyTorch module and is scheduled asynchronously. When a task is created with specific nn.Module or jit module, a sub-thread is initialized and bound to this task. During the initialization, an OpenMP worker group is created and bound to this sub-thread. After initialization, the sub-thread waits for input. When the main thread submits an input to this task, the sub-thread will wake up and execute the input. The main thread returns a FutureTensor and is not block until an explicit FutureTensor.get() is invoked to get the results executed in the sub-thread.

+
+
+

IOMP preload or load during the runtime

+

Since Runtime Extension relies on the APIs from IOMP, we need to preload IOMP before executing the application. We want Intel® Extension for PyTorch* built with Runtime API enabled. This means it should work fine without loading IOMP if the user didn’t use the runtime API. Here we choose to dlopen IOMP library during runtime and we ensure the IOMP symbols are initialized once globally.

+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/split_sgd.html b/cpu/2.4.0+cpu/tutorials/features/split_sgd.html new file mode 100644 index 000000000..cada6b407 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/split_sgd.html @@ -0,0 +1,239 @@ + + + + + + + Split SGD — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Split SGD

+

Both optimizations for inference workloads and training workloads are within Intel’s optimization scope. Optimizations for train optimizer functions are an important perspective. The optimizations use a mechanism called Split SGD and take advantage of BFloat16 data type and operator fusion. Optimizer adagrad, lamb and sgd are supported.

+
+

BFloat16

+

The figure below shows definition of Float32 (top) and BFloat16 (bottom) data types. Compared to Float32, BFloat16 is only half as long, and thus saves half the memory. It is supported natively at the instruction set level to boost deep learning workloads from the 3rd Generation of Intel® Xeon® Scalable Processors. It is compatible to Float32 since both have the same bit length for “sign” and “exponent” part. BFloat16 only has a 7-bit “mantissa” part while Float32 has 23 bits. BFloat16 has the same capacity to represent “digit ranges” with that of Float32, but has a shorter “precision” part.

+Data types +

An advantage of BFloat16 is that it saves memory and reduces computation workload, but the fewer mantissa bits brings negative effects as well. Let’s use an “ADD” operation as an example to explain the disadvantage. To perform addition of 2 floating point numbers, we need to shift the mantissa part of the numbers left or right to align their exponent parts. Since BFloat16 has a shorter mantissa part, it is much easier than Float32 to lose its mantissa part after the shifting, and thus cause an accuracy loss issue.

+

Let’s use the following two decimal numbers x and y as an example. We first do the calculation in a high precision data type (10 valid numbers after decimal point).

+
+\[\begin{split}x &= 0.1234500000*10^{10} \\ +y &= 0.1234500000*10^{5} \\ +x+y &= 0.1234500000*10^{10} + 0.1234500000*10^{5} \\ + &= 0.1234500000*10^{10} + 0.0000012345*10^{10} \\ + & =0.1234512345*10^{10}\end{split}\]
+

This makes sense because after shifting y right by 5 digits, the fraction part is still there.

+

Let’s do the calculation using a low precision data type (5 valid numbers after decimal point):

+
+\[\begin{split}x &= 0.12345*10^{10} \\ +y &= 0.12345*10^{5} \\ +x+y &= 0.12345*10^{10} + 0.12345*10^{5} \\ + &= 0.12345*10^{10} + 0.00000*10^{10} \\ + &= 0.12345*10^{10}\end{split}\]
+

Since the data type has only 5 digits for the fraction part, after shifting y by 5 digits, its fraction part is fully removed. This causes significant accuracy loss and, buy their nature, is a drawback of lower-precision data types.

+
+
+

Stochastic Gradient Descent (SGD)

+

Basically, training involves 3 steps:

+
    +
  1. Forward propagation: Performance inference once and compare the results with ground truth to get loss number.

  2. +
  3. Backward propagation: Utilize chain rule to calculate gradients of parameters based on the loss number.

  4. +
  5. Parameter update: Update value of parameters by gradients along with calculated loss values.

  6. +
+

The training is actually a loop of these 3 steps in sequence until the loss number meets requirements or after a determined timeout duration. The Stochastic Gradient Descent (SGD) is most widely used at the 3rd step to update parameter values. To make it easy to understand, the 3rd step is described as the following formula:

+
+\[W = W + α * gW\]
+

Where \(W\) denotes parameters to be updated. \(gW\) denotes gradient received during backward propagation and \(α\) denotes learning rate.

+
+
+

Split SGD

+

Since the addition applied in SGD is repeated, because of the low precision data loss mentioned earlier, if both the \(W\) and \(gW\) are stored in BFloat16 data type, we will most likely lose valid bits and make the training results inaccurate. Using FP32 master parameters is a common practice for avoiding the round-off errors at parameter update step. +To keep FP32 master parameters, we have 3 design choices: +1. Only save FP32 parameters: For this choice, we need introduce additional FP32->BF16 cast at each iter to get benefit from BF16 at forward and backward propagation steps. +2. Save both FP32 and BF16 parameters: BF16 parameter is used at forward and backward propagation steps. Use FP32 master parameter at update steps. For this choice we introduce more memory footprint. +3. “Split” choice: In order to get performance benefits with BFloat16 at forward and backward propagation steps, while avoiding increase the memory footprint, we propose the mechanism “Split SGD”.

+

The idea is to “split” a 32-bit floating point number into 2 parts:

+
    +
  1. Top half: First 16 bits can be viewed as exactly a BFloat16 number.

  2. +
  3. Bottom half: Last 16 bits are still kept to avoid accuracy loss.

  4. +
+

FP32 parameters are split into “Top half” and “Bottom half”. When performing forward and backward propagations, the Top halves are used to take advantage of Intel BFloat16 support. When performing parameter update with SGD, we concatenate the Top half and the Bottom half to recover the parameters back to FP32 and then perform regular SGD operations.

+

It is a common practice to use FP32 for master parameters in order to avoid round-off errors with BF16 parameter update. SplitSGD is an optimization of storing FP32 master parameters with reduced memory footprint.

+Split SGD +
+

+
+

The following pseudo code illustrates the process of Split SGD.

+
fp32_w = concat_fp32_from_bf16(bf16_w, trail)
+fp32_gw = bf16_gw.float()
+fp32_w += α* fp32_gw (sgd step without weight_dacay, momentum)
+bf16_w, trail = split_bf16_from_fp32(fp32_w)
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/features/sq_recipe_tuning_api.html b/cpu/2.4.0+cpu/tutorials/features/sq_recipe_tuning_api.html new file mode 100644 index 000000000..f770dda7f --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/features/sq_recipe_tuning_api.html @@ -0,0 +1,231 @@ + + + + + + + Smooth Quant Recipe Tuning API (Prototype) — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Smooth Quant Recipe Tuning API (Prototype)

+

Smooth Quantization is a popular method to improve the accuracy of int8 quantization. +The autotune API allows automatic global alpha tuning, and automatic layer-by-layer alpha tuning provided by Intel® Neural Compressor for the best INT8 accuracy.

+

SmoothQuant will introduce alpha to calculate the ratio of input and weight updates to reduce quantization error. SmoothQuant arguments are as below:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ArgumentsDefault ValueAvailable ValuesComments
alpha'auto'[0-1] / 'auto'value to balance input and weight quantization error.
init_alpha0.5[0-1] / 'auto'value to get baseline quantization error for auto-tuning.
alpha_min0.0[0-1]min value of auto-tuning alpha search space
alpha_max1.0[0-1]max value of auto-tuning alpha search space
alpha_step0.1[0-1]step_size of auto-tuning alpha search space
shared_criterion"mean"["min", "mean","max"]criterion for input LayerNorm op of a transformer block.
enable_blockwise_lossFalse[True, False]whether to enable block-wise auto-tuning

Please refer to the LLM examples for complete examples.

+

Note: When defining dataloaders for calibration, please follow INC’s dataloader format.

+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/getting_started.html b/cpu/2.4.0+cpu/tutorials/getting_started.html new file mode 100644 index 000000000..202d74d52 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/getting_started.html @@ -0,0 +1,303 @@ + + + + + + + Quick Start — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Quick Start

+

The following instructions assume you have installed the Intel® Extension for PyTorch*. For installation instructions, refer to Installation.

+

To start using the Intel® Extension for PyTorch* in your code, you need to make the following changes:

+
    +
  1. Import the extension with import intel_extension_for_pytorch as ipex.

  2. +
  3. Invoke the optimize() function to apply optimizations.

  4. +
  5. Convert the eager mode model to a graph mode model.

    +
      +
    • For TorchScript, invoke torch.jit.trace() and torch.jit.freeze()

    • +
    • For TorchDynamo, invoke torch.compile(model, backend="ipex")(Beta feature)

    • +
    +
  6. +
+

Important: It is highly recommended to import intel_extension_for_pytorch right after import torch, prior to importing other packages.

+

The example below demostrates how to use the Intel® Extension for PyTorch* with TorchScript:

+
import torch
+############## import ipex ###############
+import intel_extension_for_pytorch as ipex
+##########################################
+
+model = Model()
+model.eval()
+data = ...
+
+############## TorchScript ###############
+model = ipex.optimize(model, dtype=torch.bfloat16)
+
+with torch.no_grad(), torch.cpu.amp.autocast():
+  model = torch.jit.trace(model, data)
+  model = torch.jit.freeze(model)
+  model(data)
+##########################################
+
+
+

The example below demostrates how to use the Intel® Extension for PyTorch* with TorchDynamo:

+
import torch
+############## import ipex ###############
+import intel_extension_for_pytorch as ipex
+##########################################
+
+model = Model()
+model.eval()
+data = ...
+
+############## TorchDynamo ###############
+model = ipex.optimize(model, weights_prepack=False)
+
+model = torch.compile(model, backend="ipex")
+with torch.no_grad():
+  model(data)
+##########################################
+
+
+

More examples, including training and usage of low precision data types are available in the Examples section.

+

In Cheat Sheet, you can find more commands that can help you start using the Intel® Extension for PyTorch*.

+
+

LLM Quick Start

+

ipex.llm.optimize is used for Large Language Models (LLM).

+
import torch
+#################### code changes ####################  
+import intel_extension_for_pytorch as ipex
+######################################################  
+import argparse
+from transformers import (
+    AutoConfig,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+)
+
+# args
+parser = argparse.ArgumentParser("Generation script (fp32/bf16 path)", add_help=False)
+parser.add_argument(
+    "--dtype",
+    type=str,
+    choices=["float32", "bfloat16"],
+    default="float32",
+    help="choose the weight dtype and whether to enable auto mixed precision or not",
+)
+parser.add_argument(
+    "--max-new-tokens", default=32, type=int, help="output max new tokens"
+)
+parser.add_argument(
+    "--prompt", default="What are we having for dinner?", type=str, help="input prompt"
+)
+parser.add_argument("--greedy", action="store_true")
+parser.add_argument("--batch-size", default=1, type=int, help="batch size")
+args = parser.parse_args()
+print(args)
+
+# dtype
+amp_enabled = True if args.dtype != "float32" else False
+amp_dtype = getattr(torch, args.dtype)
+
+# load model
+model_id = MODEL_ID
+config = AutoConfig.from_pretrained(
+    model_id, torchscript=True, trust_remote_code=True
+)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=amp_dtype,
+    config=config,
+    low_cpu_mem_usage=True,
+    trust_remote_code=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    model_id,
+    trust_remote_code=True
+)
+model = model.eval()
+model = model.to(memory_format=torch.channels_last)
+
+# Intel(R) Extension for PyTorch*
+#################### code changes ####################  # noqa F401
+model = ipex.llm.optimize(
+    model,
+    dtype=amp_dtype,
+    inplace=True,
+    deployment_mode=True,
+)
+######################################################  # noqa F401
+
+# generate args
+num_beams = 1 if args.greedy else 4
+generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=num_beams)
+
+# input prompt
+prompt = args.prompt
+input_size = tokenizer(prompt, return_tensors="pt").input_ids.size(dim=1)
+print("---- Prompt size:", input_size)
+prompt = [prompt] * args.batch_size
+
+# inference
+with torch.inference_mode(), torch.cpu.amp.autocast(enabled=amp_enabled):
+    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+    gen_ids = model.generate(
+        input_ids,
+        max_new_tokens=args.max_new_tokens,
+        **generate_kwargs
+    )
+    gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
+    input_tokens_lengths = [x.shape[0] for x in input_ids]
+    output_tokens_lengths = [x.shape[0] for x in gen_ids]
+    total_new_tokens = [
+        o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)
+    ]
+    print(gen_text, total_new_tokens, flush=True)
+
+
+

More LLM examples, including usage of low precision data types are available in the LLM Examples section.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/installation.html b/cpu/2.4.0+cpu/tutorials/installation.html new file mode 100644 index 000000000..1bf267b5a --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/installation.html @@ -0,0 +1,153 @@ + + + + + + + Installation — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Installation

+

Select your preferences and follow the installation instructions provided on the Installation page.

+

After successful installation, refer to the Quick Start and Examples sections to start using the extension in your code.

+

NOTE: For detailed instructions on installing and setting up the environment for Large Language Models (LLM), as well as example scripts, refer to the LLM best practices.

+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/introduction.html b/cpu/2.4.0+cpu/tutorials/introduction.html new file mode 100644 index 000000000..ee1fd00e6 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/introduction.html @@ -0,0 +1,177 @@ + + + + + + + Introduction — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Introduction

+

Intel® Extension for PyTorch* extends PyTorch* with the latest performance optimizations for Intel hardware. +Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs.

+
+

Note

+

The package name used when you import Intel® Extension for PyTorch* changed +from intel_pytorch_extension (for versions 1.2.0 through 1.9.0) to +intel_extension_for_pytorch (for versions 1.10.0 and later). Use the +correct package name depending on the version you are using.

+
+

For the detailed list of supported features and usage instructions, refer to Features. For overview of Large Language Models (LLM) optimizations and usage instructions, refer to +the Large Language Models (LLM) section.

+
+

Get Started

+ +
+
+

API Documentation

+

For detailed description of the Intel® Extension for PyTorch* APIs, refer to the API Documentation section.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/known_issues.html b/cpu/2.4.0+cpu/tutorials/known_issues.html new file mode 100644 index 000000000..dec99ca2c --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/known_issues.html @@ -0,0 +1,337 @@ + + + + + + + Troubleshooting — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Troubleshooting

+
+

General Usage

+
    +
  • Problem: Issues with the +cpu PyTorch package.

    +
      +
    • Cause: Certain Python packages may have PyTorch as a hard dependency. If you installed the +cpu version of PyTorch, installation of these packages might replace the +cpu version with the default version released on Pypi.org.

    • +
    • Solution: Reinstall the +cpu version back.

    • +
    +
  • +
  • Problem: The workload running with Intel® Extension for PyTorch* occupies a remarkably large amount of memory.

    +
      +
    • Solution: Try to reduce the occupied memory size by setting the weights_prepack parameter of the ipex.optimize() function to False.

    • +
    +
  • +
  • Problem: The conv+bn folding feature of the ipex.optimize() function does not work if inference is done with a custom function:

    +
    import torch
    +import intel_pytorch_extension as ipex
    +
    +class Module(torch.nn.Module):
    +    def __init__(self):
    +        super(Module, self).__init__()
    +        self.conv = torch.nn.Conv2d(1, 10, 5, 1)
    +        self.bn = torch.nn.BatchNorm2d(10)
    +        self.relu = torch.nn.ReLU()
    +
    +    def forward(self, x):
    +        x = self.conv(x)
    +        x = self.bn(x)
    +        x = self.relu(x)
    +        return x
    +
    +    def inference(self, x):
    +        return self.forward(x)
    +
    +if __name__ == '__main__':
    +    m = Module()
    +    m.eval()
    +    m = ipex.optimize(m, dtype=torch.float32, level="O0")
    +    d = torch.rand(1, 1, 112, 112)
    +    with torch.no_grad():
    +      m.inference(d)
    +
    +
    +
      +
    • Cause: PyTorch FX limitation.

    • +
    • Solution: You can avoid this error by calling m = ipex.optimize(m, level="O0"), which doesn’t apply ipex optimization, or disable conv+bn folding by calling m = ipex.optimize(m, level="O1", conv_bn_folding=False).

    • +
    +
  • +
+
+
+

Performance Regression

+
    +
  • Some models may experience performance regression comparing to 2.0.x due to deprecation of the NNC feature in PyTorch*.

  • +
+
+
+

TorchDynamo

+
    +
  • Problem: A workload that uses torch.compile() fails to run or demonstrates poor performance.

    +
      +
    • Cause: The support of torch.compile() with ipex as the backend is still an beta feature. Currently, the following HuggingFace models fail to run using torch.compile() with ipex backend due to memory issues:

      +
        +
      • masked-language-modeling+xlm-roberta-base

      • +
      • casual-language-modeling+gpt2

      • +
      • casual-language-modeling+xlm-roberta-base

      • +
      • summarization+t5-base

      • +
      • text-classification+allenai-longformer-base-409

      • +
      +
    • +
    • Solution: Use the torch.jit APIs and graph optimization APIs of the Intel® Extension for PyTorch*.

    • +
    +
  • +
+
+
+

Dynamic Shape

+
    +
  • Problem: When working with an NLP model inference with dynamic input data length using TorchScript (either torch.jit.trace or torch.jit.script), performance with Intel® Extension for PyTorch* may be less than that without Intel® +Extension for PyTorch*.

    +
      +
    • Solution: Use the workaround below:

      +
        +
      • Python interface

        +
        torch._C._jit_set_texpr_fuser_enabled(False)
        +
        +
        +
      • +
      • C++ interface

        +
        #include <torch/csrc/jit/passes/tensorexpr_fuser.h>
        +torch::jit::setTensorExprFuserEnabled(false);
        +
        +
        +
      • +
      +
    • +
    +
  • +
+
+
+

INT8

+
    +
  • Problem: Limitations of dynamic shapes support of static quantization:

    +
      +
    • When an input shape is provided in runtime for the first time, execution could take longer time to compile a new kernel for this shape. Specifically, the new kernel compilation time could be long for complicated kernels.

    • +
    • Channels Last format won’t take effect with dynamic input shapes for CNN models at this time. Optimizations are undergoing.

    • +
    +
  • +
  • Problem: RuntimeError: Overflow when unpacking long when a tensor’s min max value exceeds int range while performing int8 calibration.

    +
      +
    • Solution: Customize QConfig to use min-max calibration method.

    • +
    +
  • +
  • Problem: Models get large accuracy loss with the default quantization recipe.

    + +
  • +
  • Problem: Incorrect results with large tensors when calibrating with quantize_per_tensor, when benchmarking with 1 OpenMP* thread (find more detailed info here.

    +
      +
    • Solution: Editing your code following the pseudocode below can workaround this issue, if you do need to explicitly set OMP_NUM_THREAEDS=1 for benchmarking. However, there could be a performance regression if oneDNN graph compiler prototype feature is used.

      +

      Workaround pseudocode:

      +
      # perform convert/trace/freeze with omp_num_threads > 1(N)
      +torch.set_num_threads(N)
      +prepared_model = prepare(model, input)
      +converted_model = convert(prepared_model)
      +traced_model = torch.jit.trace(converted_model, input)
      +freezed_model = torch.jit.freeze(traced_model)
      +# run freezed model to apply optimization pass
      +freezed_model(input)
      +
      +# benchmarking with omp_num_threads = 1
      +torch.set_num_threads(1)
      +run_benchmark(freezed_model, input)
      +
      +
      +
    • +
    +
  • +
  • For models with dynamic control flow, please try dynamic quantization. Users are likely to get performance gain for GEMM models.

  • +
  • Support for EmbeddingBag with INT8 when bag size > 1 is work in progress.

  • +
+
+
+

BFloat16

+
    +
  • Problem: BF16 AMP(auto-mixed-precision) runs abnormally with the extension on the AVX2-only machine if the topology contains Conv, Matmul, Linear, and BatchNormalization.

    +
      +
    • Solution: TBD

    • +
    +
  • +
  • Problem: A PyTorch* model containing torch.nn.TransformerEncoderLayer component may encounter a RuntimeError in BF16 training or inference process if the model is optimized by ipex.optimize() with arguments set to default values.

    +
      +
    • Solution: TransformerEncoderLayer optimized by ipex.optimize() with weight prepacking functionality enabled may encounter a weight dimension issue. The error can be avoided by disabling weight prepacking, model = ipex.optimize(model, weights_prepack=False).

    • +
    +
  • +
+
+
+

Runtime Extension

+

The following limitations currently exist:

+
    +
  • Runtime extension of MultiStreamModule does not support DLRM inference, since the input of DLRM (EmbeddingBag specifically) cannot be simply batch split.

  • +
  • Runtime extension of MultiStreamModule has poor performance of RNNT Inference comparing with native throughput mode. Only part of the RNNT models (joint_net specifically) can be jit traced into graph. However, in one batch inference, joint_net is invoked multiple times. It increases the overhead of MultiStreamModule as input batch split, thread synchronization and output concat.

  • +
+
+
+

Result Correctness

+
    +
  • Problem: Incorrect Conv and Linear result if the number of OMP threads is changed at runtime.

    +
      +
    • Cause: The oneDNN memory layout depends on the number of OMP threads, which requires the caller to detect the changes for the # of OMP threads while this release has not implemented it yet.

    • +
    +
  • +
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/license.html b/cpu/2.4.0+cpu/tutorials/license.html new file mode 100644 index 000000000..12055243e --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/license.html @@ -0,0 +1,153 @@ + + + + + + + License — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

License

+

Intel® Extension for PyTorch* is licensed under Apache License Version 2.0. This software includes components that have separate copyright notices and licensing terms. Your use of the source code for these components is subject to the terms and conditions of the following licenses.

+

Apache License Version 2.0:

+

Intel® Extension for PyTorch* LICENSE

+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/llm.html b/cpu/2.4.0+cpu/tutorials/llm.html new file mode 100644 index 000000000..906cb037e --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/llm.html @@ -0,0 +1,866 @@ + + + + + + + Large Language Models (LLM) Optimization Overview — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Large Language Models (LLM) Optimization Overview

+

In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Most of LLMs are GPT-like architectures that consist of multiple Decoder layers. +The MultiHeadAttention and FeedForward layer are two key components of every Decoder layer. The generation task is memory bound because iterative decode and kv_cache require special management to reduce memory overheads. Intel® Extension for PyTorch* provides a lot of specific optimizations for these LLMs. +On the operator level, the extension provides highly efficient GEMM kernel to speed up Linear layer and customized operators to reduce the memory footprint. To better trade-off the performance and accuracy, different low-precision solutions e.g., smoothQuant and weight-only-quantization are also enabled. Besides, tensor parallel can also adopt to get lower latency for LLMs.

+

These LLM-specific optimizations can be automatically applied with a single frontend API function in Python interface, ipex.llm.optimize(). Check llm.optimize for more details.

+
+
+
+

ipex.llm Optimized Model List for Inference

+
+

Verified for single instance mode

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MODEL
FAMILY
MODEL NAME
(Huggingface hub)
FP32BF16Static
Quantization
INT8
Weight-Only
Quantization
INT8
Weight-Only
Quantization
INT4

LLAMA

meta-llama/Llama-2-7b-hf

🟩

🟩

🟨

🟩

🟨

LLAMA

meta-llama/Llama-2-13b-hf

🟩

🟩

🟩

🟩

🟩

LLAMA

meta-llama/Llama-2-70b-hf

🟩

🟩

🟩

🟩

🟩

LLAMA

meta-llama/Meta-Llama-3-8B

🟩

🟩

🟨

🟩

🟨

LLAMA

meta-llama/Meta-Llama-3-70B

🟩

🟩

🟨

🟩

🟩

LLAMA

meta-llama/Meta-Llama-3.1-8B-Instruct

🟩

🟩

🟨

🟩

🟩

GPT-J

EleutherAI/gpt-j-6b

🟩

🟩

🟩

🟩

🟩

GPT-NEOX

EleutherAI/gpt-neox-20b

🟩

🟨

🟨

🟩

🟨

DOLLY

databricks/dolly-v2-12b

🟩

🟨

🟨

🟩

🟨

FALCON

tiiuae/falcon-7b

🟩

🟩

🟩

🟩

FALCON

tiiuae/falcon-11b

🟩

🟩

🟩

🟩

🟨

FALCON

tiiuae/falcon-40b

🟩

🟩

🟩

🟩

🟩

OPT

facebook/opt-30b

🟩

🟩

🟩

🟩

🟨

OPT

facebook/opt-1.3b

🟩

🟩

🟩

🟩

🟨

Bloom

bigscience/bloom-1b7

🟩

🟨

🟩

🟩

🟨

CodeGen

Salesforce/codegen-2B-multi

🟩

🟩

🟩

🟩

🟩

Baichuan

baichuan-inc/Baichuan2-7B-Chat

🟩

🟩

🟩

🟩

🟨

Baichuan

baichuan-inc/Baichuan2-13B-Chat

🟩

🟩

🟨

🟩

🟨

Baichuan

baichuan-inc/Baichuan-13B-Chat

🟩

🟨

🟩

🟩

🟨

ChatGLM

THUDM/chatglm3-6b

🟩

🟩

🟨

🟩

🟨

ChatGLM

THUDM/chatglm2-6b

🟩

🟩

🟩

🟩

🟨

GPTBigCode

bigcode/starcoder

🟩

🟩

🟨

🟩

🟨

T5

google/flan-t5-xl

🟩

🟩

🟨

🟩

MPT

mosaicml/mpt-7b

🟩

🟩

🟩

🟩

🟩

Mistral

mistralai/Mistral-7B-v0.1

🟩

🟩

🟨

🟩

🟨

Mixtral

mistralai/Mixtral-8x7B-v0.1

🟩

🟩

🟩

🟨

Stablelm

stabilityai/stablelm-2-1_6b

🟩

🟩

🟨

🟩

🟨

Qwen

Qwen/Qwen-7B-Chat

🟩

🟩

🟨

🟩

🟨

Qwen

Qwen/Qwen2-7B

🟩

🟩

🟨

🟩

🟨

LLaVA

liuhaotian/llava-v1.5-7b

🟩

🟩

🟩

🟩

GIT

microsoft/git-base

🟩

🟩

🟩

Yuan

IEITYuan/Yuan2-102B-hf

🟩

🟩

🟨

Phi

microsoft/phi-2

🟩

🟩

🟩

🟩

🟨

Phi

microsoft/Phi-3-mini-4k-instruct

🟩

🟩

🟨

🟩

🟨

Phi

microsoft/Phi-3-mini-128k-instruct

🟩

🟩

🟨

🟩

🟨

Phi

microsoft/Phi-3-medium-4k-instruct

🟩

🟩

🟨

🟩

🟨

Phi

microsoft/Phi-3-medium-128k-instruct

🟩

🟩

🟨

🟩

🟨

Whisper

openai/whisper-large-v2

🟩

🟩

🟩

🟩

+
    +
  • 🟩 signifies that the model can perform well and with good accuracy (<1% difference as compared with FP32).

  • +
  • 🟨 signifies that the model can perform well while accuracy may not been in a perfect state (>1% difference as compared with FP32).

  • +
+
+
+

Verified for distributed inference mode via DeepSpeed

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MODEL
FAMILY
MODEL NAME
(Huggingface hub)
BF16Weight-Only
Quantization
INT8

LLAMA

meta-llama/Llama-2-7b-hf

🟩

🟩

LLAMA

meta-llama/Llama-2-13b-hf

🟩

🟩

LLAMA

meta-llama/Llama-2-70b-hf

🟩

🟩

LLAMA

meta-llama/Meta-Llama-3-8B

🟩

🟩

LLAMA

meta-llama/Meta-Llama-3-70B

🟩

🟩

LLAMA

meta-llama/Meta-Llama-3.1-8B-Instruct

🟩

🟩

GPT-J

EleutherAI/gpt-j-6b

🟩

🟩

GPT-NEOX

EleutherAI/gpt-neox-20b

🟨

🟩

DOLLY

databricks/dolly-v2-12b

🟨

🟩

FALCON

tiiuae/falcon-11b

🟩

🟩

FALCON

tiiuae/falcon-40b

🟩

🟩

OPT

facebook/opt-30b

🟨

🟩

OPT

facebook/opt-1.3b

🟩

🟩

Bloom

bigscience/bloom-1b7

🟨

🟩

CodeGen

Salesforce/codegen-2B-multi

🟩

🟩

Baichuan

baichuan-inc/Baichuan2-7B-Chat

🟩

🟩

Baichuan

baichuan-inc/Baichuan2-13B-Chat

🟩

🟩

Baichuan

baichuan-inc/Baichuan-13B-Chat

🟨

🟩

GPTBigCode

bigcode/starcoder

🟩

🟩

T5

google/flan-t5-xl

🟩

🟩

Mistral

mistralai/Mistral-7B-v0.1

🟩

🟩

Mistral

mistralai/Mixtral-8x7B-v0.1

🟩

🟩

MPT

mosaicml/mpt-7b

🟩

🟩

Stablelm

stabilityai/stablelm-2-1_6b

🟩

🟩

Qwen

Qwen/Qwen-7B-Chat

🟩

🟩

Qwen

Qwen/Qwen2-7B

🟩

🟩

GIT

microsoft/git-base

🟩

🟩

Phi

microsoft/phi-2

🟩

🟩

Phi

microsoft/Phi-3-mini-4k-instruct

🟩

🟩

Phi

microsoft/Phi-3-mini-128k-instruct

🟩

🟩

Phi

microsoft/Phi-3-medium-4k-instruct

🟩

🟩

Phi

microsoft/Phi-3-medium-128k-instruct

🟩

🟩

Whisper

openai/whisper-large-v2

🟩

🟩

+
    +
  • 🟩 signifies that the model can perform well and with good accuracy (<1% difference as compared with FP32).

  • +
  • 🟨 signifies that the model can perform well while accuracy may not been in a perfect state (>1% difference as compared with FP32).

  • +
+

Note: The above verified models (including other models in the same model family, like “codellama/CodeLlama-7b-hf” from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp32/bf16). We are working in progress to better support the models in the tables with various data types. In addition, more models will be optimized in the future.

+

Please check LLM best known practice for instructions to install/setup environment and example scripts.

+
+
+
+

Module Level Optimization API for customized LLM (Prototype)

+

In the past year, LLM has been flourishing with many open-sourced models contributed to the community, while researchers are building their own LLMs from transformer blocks with variants in implementation details. To help LLM researchers and developers improve their productivity, Intel® Extension for PyTorch* provides module level optimizations for commonly used LLM modules and functionalities, which are operators or certain operator combinations in nature.

+

Please check LLM module level optimization practice to better understand how to use module level APIs to optimize your LLM and achieve better performance.

+
+
+

Demos

+

Intel® Extension for PyTorch* LLM optimizations can be integrated into a typical LLM Q&A web service.

+ + + + + + +
UI with BF16 +UI with INT8 +
+

Following figures show demos with Llama 2 model and GPT-J model with single inference and distributed inference with deepspeed with lower precision data types.

+ + + + + + + + + + + +
+Llama 2 with BF16 +
+

a

+
+
+
+Llama 2 with INT8 Quantization with SmoothQuant +
+

b

+
+
+
+Weight Only Quantization with INT8 for Llama 2 +
+

c

+
+
+
+Weight Only Quantization with INT4 for GPT-J +
+

d

+
+
+
+Distributed Inference with DeepSpeed with BF16 on Llama 2 with AutoTP feature +
+

e

+
+
+
+Distributed Inference with DeepSpeed with Weight Only Quantization INT8 on Llama 2 with AutoTP feature +
+

f

+
+
+
+

Figure Legends:

+
    +
  1. Llama 2 model with BF16

  2. +
  3. Llama 2 model with INT8 Quantization with SmoothQuant technique

  4. +
  5. Llama 2 model with INT8 Weight Only Quantization

  6. +
  7. GPT-J model with INT4 Weight Only Quantization

  8. +
  9. Llama 2 model Distributed Inference with DeepSpeed with AutoTP feature on BF16

  10. +
  11. Llama 2 model Distributed Inference with DeepSpeed with AutoTP feature with Weight Only Quantization INT8

  12. +
+
+
+

Optimization Methodologies

+

The section below provides a brief introduction to LLM optimization methodologies:

+
+

Linear Operator Optimization

+

Linear operator is the most obvious hotspot in LLMs inference. Intel® Extension for PyTorch* provides dedicated optimization to speedup linear GEMM kernels, through oneDNN, customized linear kernels for weight only quantization, and some other specific tuning. All of them use specific block format to utilize hardware resources in a highly efficient way.

+
+
+

Low Precision Data Types

+

While Generative AI (GenAI) workloads and models are getting more and more popular, LLMs used in these workloads are getting more and more parameters. The increasing size of LLMs enhances workload accuracies; however, it also leads to significantly heavier computations and places higher requirements to the underlying hardware. Given that, quantization becomes a more important methodology for inference workloads.

+

Quantization with shorter data types benefits from its nature to improve memory IO throughputs and amount of computations on CPU. Moreover, shorter data types make it possible to keep more data in CPU cache, thus reducing memory access occurrences. Comparing to cache access, memory access is much more time costing. Specifically from computation perspective, AVX-512 Vector Neural Network Instructions (VNNI) instruction set shipped with the 2nd Generation Intel® Xeon® Scalable Processors and newer, as well as Intel® Advanced Matrix Extensions (Intel® AMX) instruction set shipped with the 4th Generation Intel® Xeon® Scalable Processors, provide instruction level accelerations to INT8 computations.

+

Except for the mixed-precision and INT8 native quantization solution, e.g., post-training static quantization and dynamic quantization in Pytorch, SmoothQuant and weight only quantization (both INT8 weight and INT4 weight are supported) are also enabled in Intel® Extension for PyTorch* to get beeter accuracy and performance compared with native solution.

+

Intel® Extension for PyTorch* speeds up INT8 computations by leveraging oneDNN and oneDNN graph as the backend. Intel® Extension for PyTorch* static quantization provides a default recipe to automatically decide which operators to quantize. Its backend oneDNN graph brings matrix-multiplication-based fusions for common seen operator patterns and other common fusions like quantization + data type casting. These fusions help achieve best computation cache locality and efficiency, and thus reduce INT8 quantization overhead significantly.

+

Intel® Extension for PyTorch* also delivers INT4 optimizations via 4-bit weight-only quantization (WOQ). As the name indicates, WOQ quantizes only weights to 4-bit integers to further improve the computation efficiency via saved memory bandwidth utilization. This technique reduces text generation latency especially from the second token. AMX INT8 instructions and fusions are also applied for these performant computations.

+
+
+

Indirect Access KV Cache

+

kv_cache is used to reduce computation for decoder layer but it also brings memory overheads. For example, when we use beam search, the kv_cache should be reordered according to latest beam idx and the current key/value should also be concat with kv_cache in the attention layer to get entire context to do scale dot product. When the sequence is very long, memory overheads caused by the reorder_cache and concat will be performance bottleneck. Indirect Access KV_cache (IAKV) is provided to reduce these overheads. Firstly, IAKV pre-allocates buffers (key and value use different buffer) to store all key/value hidden states and beam index information, the data format is shown in the following left figure (beam_width=4 in this case) and token state of key (value) in every timestamp will be store in this pre-allocated buffer. Secondly, we can use beam index history which is shown in the following right figure to decide which beam should be used by a timestamp and this information will generate a offset to access the kv_cache buffer which means that the reorder_cache and concat overheads will be eliminated by this way.

+The key/value cache data format +The beam idx trace for every step +
+
+

Graph Optimization

+

Operators fusion is generally used to enable sub-graph fusion to reduce the memory footprint. Except for linear post ops fusion, e.g, linear + activation function, a lot of customized operators are also provided in Intel® Extension for PyTorch* for further performance improvement. For example, Rotary Position Embedding (ROPE) and Root Mean Square Layer Normalization (RMSNorm).

+
+
+

Distributed Inference

+

All above optimizations already help you to get very good performance with single instance. To further reduce the inference latency and improve throughput, tensor parallel is also enabled in our solution. You can firstly use DeepSpeed to auto shard the model and then apply above optimizations with the frontend API function provided by Intel® Extension for PyTorch.

+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/llm/llm_optimize.html b/cpu/2.4.0+cpu/tutorials/llm/llm_optimize.html new file mode 100644 index 000000000..6a488313e --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/llm/llm_optimize.html @@ -0,0 +1,293 @@ + + + + + + + LLM Optimizations Frontend API — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

LLM Optimizations Frontend API

+

The new API function, ipex.llm.optimize, is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). +It provides optimizations for both model-wise and content-generation-wise. +You just need to invoke the ipex.llm.optimize function instead of the ipex.optimize function to apply all optimizations transparently.

+

This API currently supports for inference workloads of certain models. +API documentation is available at API Docs page, +and supported model list can be found at this page.

+

For LLM fine-tuning, please check the LLM fine-tuning tutorial.

+

API documentation is available at API Docs page.

+
+

Pseudocode of Common Usage Scenarios

+

The following sections show pseudocode snippets to invoke Intel® Extension for PyTorch* APIs to work with LLM models. +Complete examples can be found at the Example directory.

+
+

FP32/BF16

+
import torch
+import intel_extension_for_pytorch as ipex
+import transformers
+
+model= transformers.AutoModelForCausalLM(model_name_or_path).eval()
+
+dtype = torch.float # or torch.bfloat16
+model = ipex.llm.optimize(model, dtype=dtype)
+
+# inference with model.generate()
+...
+
+
+
+
+

SmoothQuant

+

Supports INT8.

+
import torch
+#################### code changes ####################  # noqa F401
+import intel_extension_for_pytorch as ipex
+from intel_extension_for_pytorch.quantization import prepare
+######################################################  # noqa F401
+import transformers
+# load model
+model = transformers.AutoModelForCausalLM.from_pretrained(...).eval()
+#################### code changes ####################  # noqa F401
+qconfig = ipex.quantization.get_smooth_quant_qconfig_mapping()
+# stage 1: calibration
+# prepare your calibration dataset samples
+calib_dataset = DataLoader(your_calibration_dataset)
+example_inputs = ... # get one sample input from calib_samples
+calibration_model = ipex.llm.optimize(
+  model.eval(),
+  quantization_config=qconfig,
+)
+prepared_model = prepare(
+  calibration_model.eval(), qconfig, example_inputs=example_inputs
+)
+with torch.no_grad():
+  for calib_samples in enumerate(calib_dataset):
+    prepared_model(calib_samples)
+prepared_model.save_qconf_summary(qconf_summary=qconfig_summary_file_path)
+
+# stage 2: quantization
+model = ipex.llm.optimize(
+  model.eval(),
+  quantization_config=qconfig,
+  qconfig_summary_file=qconfig_summary_file_path,
+)
+######################################################  # noqa F401
+
+# generation inference loop
+with torch.inference_mode():
+    model.generate({your generate parameters})
+
+
+
+
+

Weight Only Quantization (WOQ)

+

Supports INT8 and INT4.

+
import torch
+import intel_extension_for_pytorch as ipex
+import transformers
+
+model= transformers.AutoModelForCausalLM(model_name_or_path).eval()
+
+qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping(
+  weight_dtype=ipex.quantization.WoqWeightDtype.INT8, # or INT4/NF4
+  lowp_mode=ipex.quantization.WoqLowpMode.NONE, # or FP16, BF16, INT8
+)
+
+checkpoint = None # optionally load int4 or int8 checkpoint
+model = ipex.llm.optimize(model, quantization_config=qconfig, low_precision_checkpoint=checkpoint)
+
+# inference with model.generate()
+...
+
+
+
+
+

Distributed Inference with DeepSpeed

+

Distributed inference can be performed with DeepSpeed. Based on original Intel® Extension for PyTorch* scripts, the following code changes are required.

+

Check LLM distributed inference examples for complete codes.

+
import torch
+import intel_extension_for_pytorch as ipex
+import deepspeed
+import transformers
+
+dtype = torch.float # or torch.bfloat16
+deepspeed.init_distributed(deepspeed.accelerator.get_accelerator().communication_backend_name())
+
+world_size = ... # get int from env var "WORLD_SIZE" or "PMI_SIZE"
+with deepspeed.OnDevice(dtype=dtype, device="meta"):
+  model= transformers.AutoModelForCausalLM(model_name_or_path).eval()
+model = deepspeed.init_inference(
+  model,
+  mp_size=world_size,
+  base_dir=repo_root,
+  dtype=dtype,
+  checkpoint=checkpoints_json,
+  **kwargs,
+)
+model = model.module
+
+model = ipex.llm.optimize(model, dtype=dtype)
+
+# inference
+...
+
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/performance.html b/cpu/2.4.0+cpu/tutorials/performance.html new file mode 100644 index 000000000..79d9ceb1e --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/performance.html @@ -0,0 +1,1059 @@ + + + + + + + Performance — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Performance

+
+

Overview

+

This page shows performance boost with Intel® Extension for PyTorch* on several popular topologies.

+
+
+

Performance Data for Intel® AI Data Center Products

+

Find the latest performance data for 4th gen Intel® Xeon® Scalable processors and 3rd gen Intel® Xeon® processors, including detailed hardware and software configurations, at Intel® Developer Zone article.

+
+
+

LLM Performance

+

We benchmarked LLaMA2 7B, 13B, GPT-J 6B with test input token length set to 256 and 1024 respectively. The tests were carried out on AWS M7i and M6i instances. CPUs of M6i instances are 3rd Gen Intel® Xeon® Processors which do not have AMX instructions for BF16 computing acceleration, so we take FP32 precision for benchmarking instead of BF16 on M6i instances.

+

LLaMA2 7B Results

+

LLaMA2 13B Results

+

GPT-J 6B Results

+

The LLM inference performances on M7i and M6i instances are compared based on the above results. M7i, with the 4th Gen Xeon® processors, has a remarkable performance advantage over M6i with the 3rd Gen Xeon® processors.

+

M7i performance boost ratio over M6i for non-quantized (BF16 or FP32) models:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
SpeedupThroughput
LLaMA2 7B2.47x2.62x
LLaMA2 13B2.57x2.62x
GPT-J 6B2.58x2.85x

M7i performance boost ratio over M6i for INT8 quantized models:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
SpeedupThroughput
LLaMA2 7B1.27x1.38x
LLaMA2 13B1.27x1.27x
GPT-J 6B1.29x1.36x

We can also conclude that with a larger batch size the capacity of the model service can be improved at the cost of longer response latency for the individual sessions. The following table exhibits that for INT8 quantized LLaMA2-7b model on M7i instances, input batch_size=8 would increase the total throughput by 6.47x compared with batch_size=1, whereas P90 token latency gets 1.26x longer.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Batch sizeDecoder latencyTotal tokens per sec
13926.32
849170.21
Ratio1.26x6.47x

Note: Measured by Intel on 17th Aug 2023; M7i.16xLarge, M6i.16xLarge instances in US-west-2. OS-Ubuntu 22.04-lts, kernel 6.20.0-1009-aws, SW: PyTorch* 2.1 and Intel® Extension for PyTorch* 2.1/llm_feature_branch.

+
+
+

INT8 with v1.11

+
+

Performance Numbers

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HardwareWorkload1PrecisionThroughput Inference2Realtime Inference3Model TypeDatasetInput Data ShapeTunable Parameters
Batch SizeBoost RatioBatch SizeBoost Ratio
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHzResNet50INT8801.83x11.44xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
SSD-ResNet34INT8802.16x11.83xComputer VisionCOCOInput shape
[3, 1200, 1200]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ResNext 32x16dINT8801.81x11.21xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
VGG-11INT8801.75x11.19xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ShuffleNetv2_x1.0INT8802.07x11.47xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
BERT-LargeINT8802.78x12.04xNLPSquadmax_seq_len=384
Task: Question Answering
Jemalloc;
Intel(R) OpenMP;
inference scripts
Bert-BaseINT8802.05x11.96xNLPMRPCmax_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts
DistilBERT-BaseINT8802.12x11.57xNLPSquadmax_seq_len=384
Task: Question Answering
Jemalloc;
Intel(R) OpenMP;
inference scripts

+1. Model Zoo for Intel® Architecture +
+2. Throughput inference runs with single instance per socket. +
+3. Realtime inference runs with multiple instances, 4 cores per instance. +

Note: Performance numbers with stock PyTorch are measured with its most performant configuration.

+

Note: Environment variable DNNL_PRIMITIVE_CACHE_CAPACITY is set to 1024.

+
+
+

Accuracy

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
WorkloadMetricFP32INT8INT8/FP32
BERT-base_text_classificationf10.810.8199.79%
BERT-Largef193.1693.0299.85%
Distilbert-basef186.8486.1399.19%
ResNet50Top176.1575.9899.78%
ResNext 32x16dTop184.1784.0599.86%
SSD-ResNet34mAP0.2000.19999.48%
VGG11Top169.0467.9698.44%
Shufflenetv2_x1.0Top169.3667.9297.93%1

+1. ShuffleNet INT8 accuracy is expected to improve w/o performance trade-off via histogram calibration algorithm. +
+
+

Configuration

+
+

Software Version

+ + + + + + + + + + + + + + + + + +
SoftwareVersion
PyTorchv1.11.0
Intel® Extension for PyTorch*v1.11.0
+
+

Hardware Configuration

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
3rd Generation Intel® Xeon® Scalable Processors
CPUIntel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
Number of nodes1
Number of sockets2
Cores/Socket40
Threads/Core2
uCode0xd0002a0
Hyper-ThreadingON
TurboBoostON
BIOS version04.12.02
Number of DDR Memory slots16
Capacity of DDR memory per slot16GB
DDR frequency3200
Total Memory/Node (DDR+DCPMM)256GB
Host OSCentOS Linux release 8.4.2105
Host Kernel4.18.0-305.10.2.el8_4.x86_64
Docker OSUbuntu 18.04.5 LTS
Spectre-Meltdown MitigationMitigated
+
+
+
+

FP32 with v1.11.200 on an AWS EC2 C6i.2xlarge instance

+
+

Performance Numbers

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HardwareWorkload1PrecisionThroughput Inference2Real-time Inference3Model TypeDatasetInput Data ShapeTunable Parameters
Batch SizeBoost RatioBatch SizeBoost Ratio
AWS EC2 C6i.2xlargeResNet50Float32641.24x11.31xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ResNext 32x16dFloat32641.07x11.05xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
VGG-11Float32641.15x11.21xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ShuffleNetv2_x1.0Float32641.12x11.30xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
MobileNet v2Float32641.08x11.12xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
BERT-LargeFloat32641.05x11.03xNLPSquadmax_seq_len=384
Task: Question Answering
Default memory allocator;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
Bert-BaseFloat32641.08x11.09xNLPMRPCmax_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 128

+1. Model Zoo for Intel® Architecture +
+2. Throughput inference runs with single instance per socket. +
+3. Realtime inference runs with multiple instances, 4 cores per instance. +

Note: Performance numbers with stock PyTorch are measured with its most performant configuration.

+

Note: Environment variable DNNL_PRIMITIVE_CACHE_CAPACITY is set to 1024.

+
+
+

Configuration

+
+

Software Version

+ + + + + + + + + + + + + + + + + +
SoftwareVersion
PyTorchv1.11.0
Intel® Extension for PyTorch*v1.11.200
+
+
+
+

FP32 and BFloat16 with v1.10

+
+

Performance Numbers

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HardwareWorkload1PrecisionThroughput Inference2Real-time Inference3Model TypeDatasetInput Data ShapeTunable Parameters
Batch SizeBoost RatioBatch SizeBoost Ratio
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHzResNet50Float32801.39x11.35xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
SSD-ResNet34Float321601.55x11.06xComputer VisionCOCOInput shape
[3, 1200, 1200]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ResNext 32x16dFloat32801.08x11.08xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
Faster R-CNN ResNet50 FPNFloat32801.71x11.07xComputer VisionCOCOInput shape
[3, 1200, 1200]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
VGG-11Float321601.20x11.13xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ShuffleNetv2_x1.0Float321601.32x11.20xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
MobileNet v2Float321601.48x11.12xComputer VisionImageNetInput shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
DLRMFloat32801.11x1-RecommendationTerabyte-Default memory allocator;
Intel(R) OpenMP;
inference scripts
BERT-LargeFloat32801.14x11.02xNLPSquadmax_seq_len=384
Task: Question Answering
Default memory allocator;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
Bert-BaseFloat321601.10x11.33xNLPMRPCmax_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 128
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHzBERT-LargeBFloat16561.67x11.45xNLPSquadmax_seq_len=384
Task: Question Answering
Jemalloc;
Intel(R) OpenMP;
inference scripts
Bert-BaseBFloat161121.77x11.18xNLPMRPCmax_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts

+1. Model Zoo for Intel® Architecture +
+2. Throughput inference runs with single instance per socket. +
+3. Realtime inference runs with multiple instances, 4 cores per instance. +

Note: Performance numbers with stock PyTorch are measured with its most performant configuration.

+

Note: Environment variable DNNL_PRIMITIVE_CACHE_CAPACITY is set to 1024.

+
+
+

Configuration

+
+

Software Version

+ + + + + + + + + + + + + + + + + +
SoftwareVersion
PyTorchv1.10.1
Intel® Extension for PyTorch*v1.10.100
+
+

Hardware Configuration

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
3rd Generation Intel® Xeon® Scalable ProcessorsProducts formerly Cooper Lake
CPUIntel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHzIntel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz
Number of nodes11
Number of sockets22
Cores/Socket4028
Threads/Core22
uCode0xd0002a00x700001c
Hyper-ThreadingONON
TurboBoostONON
BIOS version04.12.02WLYDCRB1.SYS.0016.P29.2006080250
Number of DDR Memory slots1612
Capacity of DDR memory per slot16GB64GB
DDR frequency32003200
Total Memory/Node (DDR+DCPMM)256GB768GB
Host OSCentOS Linux release 8.4.2105Ubuntu 18.04.4 LTS
Host Kernel4.18.0-305.10.2.el8_4.x86_644.15.0-76-generic
Docker OSUbuntu 18.04.5 LTSUbuntu 18.04.5 LTS
Spectre-Meltdown MitigationMitigatedMitigated
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/performance_tuning/launch_script.html b/cpu/2.4.0+cpu/tutorials/performance_tuning/launch_script.html new file mode 100644 index 000000000..61b4ee030 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/performance_tuning/launch_script.html @@ -0,0 +1,856 @@ + + + + + + + Launch Script Usage Guide — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Launch Script Usage Guide

+
+

Overview

+

As introduced in the Performance Tuning Guide, there are several factors that influence performance. Setting configuration options properly contributes to a performance boost. However, there is no unified configuration that is optimal to all topologies. Users need to try different combinations by themselves. A launch script is provided to automate these configuration settings to free users from this complicated work. This guide helps you to learn some common usage examples that cover many optimized configuration cases.

+

The configurations are mainly around the following perspectives.

+
    +
  1. OpenMP library: [Intel OpenMP library (default) | GNU OpenMP library]

  2. +
  3. Memory allocator: [PyTorch default memory allocator | Jemalloc | TCMalloc (default)]

  4. +
  5. Number of instances: [Single instance (default) | Multiple instances]

  6. +
+
+
+

Usage of launch script

+

The launch script is provided as a module of intel_extension_for_pytorch. You can take advantage of it with the following command:

+
ipexrun [knobs] <your_pytorch_script> [args]
+
+
+

Available option settings (knobs) are listed below:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
knobtypedefault valuehelp
-h, --help--show this help message and exit
-m, --module-FalseChanges each process to interpret the launch script as a python module, executing with the same behavior as 'python -m'.
--no-python-FalseAvoid applying python to execute program.
--log-dirstr''The log file directory. Setting it to empty ('') disables logging to files.
--log-file-prefixstr'run'log file name prefix

Launcher Common Arguments:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
knobtypedefault valuehelp
--ncores-per-instanceint0Number of cores per instance. It has to be an integer larger than or equal to -1. When set to 0, cores are evenly assigned to each instance. If number of cores cannot be divided by number of instances, residual cores are unused. When set to -1, cores are evenly assigned to each instance as much as possible to fully utilize all cores. When set to a number larger than 0, designated number of cores are assigned to each instance.
--nodes-liststr''Specify nodes list for multiple instances to run on, in format of list of single node ids "node_id,node_id,..." or list of node ranges "node_id-node_id,...". By default all nodes will be used.
--use-e-cores-FalseUse Efficient-Cores on the workloads or not. By default, only Performance-Cores are used.
--memory-allocatorstr'auto'Choose which memory allocator to run the workloads with. Supported choices are ['auto', 'default', 'tcmalloc', 'jemalloc'].
--omp-runtimestr'auto'Choose which OpenMP runtime to run the workloads with. Supported choices are ['auto', 'default', 'intel'].
--strategystr'scatter'Tell how cores are distributed over instances when only part of all cores are needed on a machine with multiple NUMA nodes. Supported choices are ['scatter', 'close']. With 'scatter', instances are distributed evenly as much as possible over all available NUMA nodes. While with 'close', instances are assigned to cores in order continuously.

Multi-instance Arguments:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
knobtypedefault valuehelp
--ninstancesint0Number of instances
--instance-idxint-1Inside the multi instance list, execute a specific instance at index. If it is set to -1, run all of them.
--use-logical-cores-FalseUse logical cores on the workloads or not. By default, only physical cores are used.
--bind-numa-node-FalseBind instances to be executed on cores on a single NUMA node.
--multi-task-managerstr'auto'Choose which multi task manager to run the workloads with. Supported choices are ['auto', 'none', 'numactl', 'taskset'].
--latency-mode-FalseUse 4 cores per instance over all physical cores.
--throughput-mode-FalseRun one instance per node with all physical cores.
--cores-liststr''Specify cores list for multiple instances to run on, in format of list of single core ids "core_id,core_id,..." or list of core ranges "core_id-core_id,...". By default all cores will be used.
--benchmark-FalseEnable benchmark config. JeMalloc's MALLOC_CONF has been tuned for low latency. Recommend to use this for benchmarking purpose; for other use cases, this MALLOC_CONF may cause Out-of-Memory crash.

Distributed Training Arguments With oneCCL backend:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
knobtypedefault valuehelp
--nnodesint0Number of machines/devices to use for distributed training
--nprocs-per-nodeint0Number of processes run on each machine/device. It is by default the number of available nodes when set to 0. Argument --nodes-list affects this default value.
--ccl-worker-countint4Number of cores per rank for ccl communication
--logical-cores-for-ccl-FalseUse logical cores for the ccl worker.
--master-addrstr127.0.0.1Address of master node (rank 0), should be either IP address or hostname of node 0. For single node multi-proc training, the --master-addr can simply be 127.0.0.1.
--master-portint29500Port on master node (rank 0) for communication during distributed training.
--hostfilestr'hostfile'Set the hostfile for multi-node multi-proc training. The hostfile includes a node address list containing either IP addresses or hostnames of computation nodes.
--extra-mpi-paramsstr''Extra parameters for mpiexec.hydra except for -np -ppn -hostfile and -genv I_MPI_PIN_DOMAIN

Codeless Optimization feature related option settings (knobs) are listed below:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
knobtypedefault valuehelp
--auto-ipex-FalseAuto enabled the ipex optimization feature
--dtypestringFalsedata type, can choose from ['float32', 'bfloat16']
--auto-ipex-verbose-FalseThis flag is only used for debug and UT of auto ipex.
--disable-ipex-graph-mode-FalseEnable the Graph Mode for ipex.optimize() function

Note: --latency-mode and --throughput-mode are exclusive knobs to --ninstances, --ncores-per-instance and --use-logical-cores. I.e., setting either of --latency-mode or --throughput-mode overwrites settings of --ninstances, --ncores-per-instance and --use-logical-cores if they are explicitly set in command line. --latency-mode and --throughput-mode are mutually exclusive.

+

The launch script respects existing environment variables when it get launched, except for LD_PRELOAD. If you have your favorite values for certain environment variables, you can set them before running the launch script. Intel OpenMP library uses an environment variable KMP_AFFINITY to control its behavior. Different settings result in different performance numbers. By default, if you enable Intel OpenMP library, the launch script will set KMP_AFFINITY to granularity=fine,compact,1,0. If you want to try with other values, you can use export command on Linux to set KMP_AFFINITY before you run the launch script. In this case, the script will not set the default value but take the existing value of KMP_AFFINITY, and print a message to stdout.

+

Execution via the launch script can dump logs into files under a designated log directory so you can do some investigations afterward. By default, it is disabled to avoid undesired log files. You can enable logging by setting knob --log-dir to be:

+
    +
  • directory to store log files. It can be an absolute path or relative path.

  • +
  • types of log files to generate. One file (<prefix>_timestamp_instances.log) contains command and information when the script was launched. Another type of file (<prefix>_timestamp_instance_#_core#-core#....log) contain stdout print of each instance.

  • +
+

For example:

+
run_20210712212258_instances.log
+run_20210712212258_instance_0_cores_0-43.log
+
+
+
+
+

Usage Examples

+

Example script resnet50.py will be used in this guide.

+ +

Note: GIF files below illustrate CPU usage ONLY. Do NOT infer performance numbers.

+
+

Single instance for inference

+
+

I. Use all physical cores

+
ipexrun --log-dir ./logs resnet50.py
+
+
+

CPU usage is shown as below. 1 main worker thread was launched, then it launched physical core number of threads on all physical cores.

+

Single instance all physical cores

+

If you check your log directory, you will find directory structure as below.

+
.
+├── resnet50.py
+└── logs
+    ├── run_20210712212258_instance_0_cores_0-43.log
+    └── run_20210712212258_instances.log
+
+
+

The run_20210712212258_instances.log contains information and command that were used for this execution launch.

+
$ cat logs/run_20210712212258_instances.log
+2021-07-12 21:22:58,764 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/<user>/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
+2021-07-12 21:22:58,764 - __main__ - INFO - OMP_NUM_THREADS=44
+2021-07-12 21:22:58,764 - __main__ - INFO - Using Intel OpenMP
+2021-07-12 21:22:58,764 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-07-12 21:22:58,764 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-07-12 21:22:58,764 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+2021-07-12 21:22:58,764 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] on different NUMA nodes
+2021-07-12 21:22:58,764 - __main__ - INFO - numactl -C 0-43 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712212258_instance_0_cores_0-43.log
+
+
+
+
+

II. Use all cores including logical cores

+
ipexrun --use-logical-core --log-dir ./logs resnet50.py
+
+
+

CPU usage is shown as below. 1 main worker thread was launched, then it launched threads on all cores, including logical cores.

+

Single instance logical cores

+

If you check your log directory, you will find directory structure as below.

+
.
+├── resnet50.py
+└── logs
+    ├── run_20210712223308_instances.log
+    └── run_20210712223308_instance_0_cores_0-87.log
+
+
+

The run_20210712223308_instances.log contains information and command that were used for this execution launch.

+
$ cat logs/run_20210712223308_instances.log
+2021-07-12 22:33:08,117 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/<user>/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
+2021-07-12 22:33:08,117 - __main__ - INFO - OMP_NUM_THREADS=88
+2021-07-12 22:33:08,117 - __main__ - INFO - Using Intel OpenMP
+2021-07-12 22:33:08,118 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-07-12 22:33:08,118 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-07-12 22:33:08,118 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+2021-07-12 22:33:08,118 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87'] on different NUMA nodes
+2021-07-12 22:33:08,118 - __main__ - INFO - numactl -C 0-87 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712223308_instance_0_cores_0-87.log
+
+
+
+
+

III. Use physical cores on designated nodes

+
ipexrun --nodes-list 1 --log-dir ./logs resnet50.py
+
+
+

CPU usage is shown as below. 1 main worker thread was launched, then it launched threads on all other cores on the same numa node.

+

Single instance all physical cores

+

If you check your log directory, you will find directory structure as below.

+
.
+├── resnet50.py
+└── logs
+    ├── run_20210712214504_instances.log
+    └── run_20210712214504_instance_0_cores_22-43.log
+
+
+

The run_20210712214504_instances.log contains information and command that were used for this execution launch.

+
$ cat logs/run_20210712214504_instances.log
+2021-07-12 21:45:04,512 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/<user>/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
+2021-07-12 21:45:04,513 - __main__ - INFO - OMP_NUM_THREADS=22
+2021-07-12 21:45:04,513 - __main__ - INFO - Using Intel OpenMP
+2021-07-12 21:45:04,513 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-07-12 21:45:04,513 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-07-12 21:45:04,513 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+2021-07-12 21:45:04,513 - __main__ - INFO - numactl -C 22-43 -m 1 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712214504_instance_0_cores_22-43.log
+
+
+
+
+

IV. Use your designated number of cores

+
ipexrun --ninstances 1 --ncores-per-instance 10 --log-dir ./logs resnet50.py
+
+
+

CPU usage is shown as below. 1 main worker thread was launched, then it launched threads on other 9 physical cores.

+

Single instance designated number of cores

+

If you check your log directory, you will find directory structure as below.

+
.
+├── resnet50.py
+└── logs
+    ├── run_20210712220928_instances.log
+    └── run_20210712220928_instance_0_cores_0-9.log
+
+
+

The run_20210712220928_instances.log contains information and command that were used for this execution launch.

+
$ cat logs/run_20210712220928_instances.log
+2021-07-12 22:09:28,355 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/<user>/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
+2021-07-12 22:09:28,355 - __main__ - INFO - OMP_NUM_THREADS=10
+2021-07-12 22:09:28,355 - __main__ - INFO - Using Intel OpenMP
+2021-07-12 22:09:28,355 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-07-12 22:09:28,356 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-07-12 22:09:28,356 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+2021-07-12 22:09:28,356 - __main__ - INFO - numactl -C 0-9 -m 0 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712220928_instance_0_cores_0-9.log
+
+
+

You can also specify the cores to be utilized using --cores-list argument. For example, if core id 11-20 are desired instead of the first 10 cores, the launch command would be as below.

+
ipexrun --ncores-per-instance 10 --cores-list "11-20" --log-dir ./logs resnet50.py
+
+
+

Please notice that when specifying --cores-list, a correspondant --ncores-per-instance argument is required for instance number deduction.

+

In this case the log directory should be like

+
.
+├── resnet50.py
+└── logs
+    ├── run_20210712221615_instances.log
+    └── run_20210712221615_instance_0_cores_11-20.log
+
+
+

The run_20210712221615_instances.log contains information and command that were used for this execution launch.

+
$ cat logs/run_20210712221615_instances.log
+2021-07-12 22:16:15,591 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/<user>/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
+2021-07-12 22:16:15,591 - __main__ - INFO - OMP_NUM_THREADS=10
+2021-07-12 22:16:15,591 - __main__ - INFO - Using Intel OpenMP
+2021-07-12 22:16:15,591 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-07-12 22:16:15,591 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-07-12 22:16:15,591 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+2021-07-12 22:16:15,591 - __main__ - INFO - numactl -C 11-20 -m 0 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221615_instance_0_cores_11-20.log
+
+
+
+
+
+

Multiple instances for inference

+
+

V. Throughput mode

+
ipexrun --throughput-mode --log-dir ./logs resnet50.py
+
+
+

CPU usage is shown as below. 2 main worker threads were launched on 2 numa nodes respectively, then they launched threads on other physical cores.

+

Multiple instance throughput mode

+

If you check your log directory, you will find directory structure as below.

+
.
+├── resnet50.py
+└── logs
+    ├── run_20210712221150_instances.log
+    ├── run_20210712221150_instance_0_cores_0-21.log
+    └── run_20210712221150_instance_1_cores_22-43.log
+
+
+

The run_20210712221150_instances.log contains information and command that were used for this execution launch.

+
$ cat logs/run_20210712221150_instances.log
+2021-07-12 22:11:50,233 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/<user>/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
+2021-07-12 22:11:50,233 - __main__ - INFO - OMP_NUM_THREADS=22
+2021-07-12 22:11:50,233 - __main__ - INFO - Using Intel OpenMP
+2021-07-12 22:11:50,233 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-07-12 22:11:50,233 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-07-12 22:11:50,233 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+2021-07-12 22:11:50,233 - __main__ - INFO - numactl -C 0-21 -m 0 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221150_instance_0_cores_0-21.log
+2021-07-12 22:11:50,236 - __main__ - INFO - numactl -C 22-43 -m 1 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221150_instance_1_cores_22-43.log
+
+
+
+
+

VI. Latency mode

+
ipexrun --latency-mode --log-dir ./logs resnet50.py
+
+
+

CPU usage is shown as below. 4 cores are used for each instance.

+

Multiple instances latency mode

+

If you check your log directory, you will find directory structure as below.

+
.
+├── resnet50.py
+└── logs
+    ├── run_20210712221415_instances.log
+    ├── run_20210712221415_instance_0_cores_0-3.log
+    ├── run_20210712221415_instance_1_cores_4-7.log
+    ├── run_20210712221415_instance_2_cores_8-11.log
+    ├── run_20210712221415_instance_3_cores_12-15.log
+    ├── run_20210712221415_instance_4_cores_16-19.log
+    ├── run_20210712221415_instance_5_cores_20-23.log
+    ├── run_20210712221415_instance_6_cores_24-27.log
+    ├── run_20210712221415_instance_7_cores_28-31.log
+    ├── run_20210712221415_instance_8_cores_32-35.log
+    ├── run_20210712221415_instance_9_cores_36-39.log
+    └── run_20210712221415_instance_10_cores_40-43.log
+
+
+

The run_20210712221415_instances.log contains information and command that were used for this execution launch.

+
$ cat logs/run_20210712221415_instances.log
+2021-07-12 22:14:15,140 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/<user>/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
+2021-07-12 22:14:15,140 - __main__ - INFO - OMP_NUM_THREADS=4
+2021-07-12 22:14:15,140 - __main__ - INFO - Using Intel OpenMP
+2021-07-12 22:14:15,140 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-07-12 22:14:15,140 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-07-12 22:14:15,140 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+2021-07-12 22:14:15,140 - __main__ - INFO - numactl -C 0-3 -m 0 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_0_cores_0-3.log
+2021-07-12 22:14:15,143 - __main__ - INFO - numactl -C 4-7 -m 0 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_1_cores_4-7.log
+2021-07-12 22:14:15,146 - __main__ - INFO - numactl -C 8-11 -m 0 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_2_cores_8-11.log
+2021-07-12 22:14:15,149 - __main__ - INFO - numactl -C 12-15 -m 0 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_3_cores_12-15.log
+2021-07-12 22:14:15,151 - __main__ - INFO - numactl -C 16-19 -m 0 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_4_cores_16-19.log
+2021-07-12 22:14:15,154 - __main__ - WARNING - Numa Aware: cores:['20', '21', '22', '23'] on different NUMA nodes
+2021-07-12 22:14:15,154 - __main__ - INFO - numactl -C 20-23 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_5_cores_20-23.log
+2021-07-12 22:14:15,157 - __main__ - INFO - numactl -C 24-27 -m 1 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_6_cores_24-27.log
+2021-07-12 22:14:15,159 - __main__ - INFO - numactl -C 28-31 -m 1 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_7_cores_28-31.log
+2021-07-12 22:14:15,162 - __main__ - INFO - numactl -C 32-35 -m 1 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_8_cores_32-35.log
+2021-07-12 22:14:15,164 - __main__ - INFO - numactl -C 36-39 -m 1 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_9_cores_36-39.log
+2021-07-12 22:14:15,167 - __main__ - INFO - numactl -C 40-43 -m 1 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221415_instance_10_cores_40-43.log
+
+
+
+
+

VII. Your designated number of instances

+
ipexrun --ninstances 4 --log-dir ./logs resnet50.py
+
+
+

CPU usage is shown as below. 4 main worker thread were launched, then they launched threads on all other physical cores.

+

Multiple instances designated number of instances

+

If you check your log directory, you will find directory structure as below.

+
.
+├── resnet50.py
+└── logs
+    ├── run_20210712221305_instances.log
+    ├── run_20210712221305_instance_0_cores_0-10.log
+    ├── run_20210712221305_instance_1_cores_11-21.log
+    ├── run_20210712221305_instance_2_cores_22-32.log
+    └── run_20210712221305_instance_3_cores_33-43.log
+
+
+

The run_20210712221305_instances.log contains information and command that were used for this execution launch.

+
$ cat logs/run_20210712221305_instances.log
+2021-07-12 22:13:05,470 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/<user>/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
+2021-07-12 22:13:05,470 - __main__ - INFO - OMP_NUM_THREADS=11
+2021-07-12 22:13:05,470 - __main__ - INFO - Using Intel OpenMP
+2021-07-12 22:13:05,470 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-07-12 22:13:05,470 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-07-12 22:13:05,470 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+2021-07-12 22:13:05,471 - __main__ - INFO - numactl -C 0-10 -m 0 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221305_instance_0_cores_0-10.log
+2021-07-12 22:13:05,473 - __main__ - INFO - numactl -C 11-21 -m 0 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221305_instance_1_cores_11-21.log
+2021-07-12 22:13:05,476 - __main__ - INFO - numactl -C 22-32 -m 1 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221305_instance_2_cores_22-32.log
+2021-07-12 22:13:05,479 - __main__ - INFO - numactl -C 33-43 -m 1 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210712221305_instance_3_cores_33-43.log
+
+
+
+
+

VIII. Your designated number of instances and instance index

+

Launcher by default runs all ninstances for multi-instance inference/training as shown above. You can specify instance_idx to independently run that instance only among ninstances

+
ipexrun --ninstances 4 --instance-idx 0 --log-dir ./logs resnet50.py
+
+
+

you can confirm usage in log file:

+
2022-01-06 13:01:51,175 - __main__ - INFO - OMP_NUM_THREADS=14
+2022-01-06 13:01:51,176 - __main__ - INFO - Using Intel OpenMP
+2022-01-06 13:01:51,177 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2022-01-06 13:01:51,177 - __main__ - INFO - KMP_BLOCKTIME=1
+2022-01-06 13:01:51,177 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+2022-01-06 13:01:51,177 - __main__ - INFO - numactl -C 0-10 -m 0 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20220106130151_instance_0_cores_0-13.log
+
+
+
ipexrun --ninstances 4 --instance-idx 1 --log-dir ./logs resnet50.py
+
+
+

you can confirm usage in log file:

+
2022-01-06 13:01:51,175 - __main__ - INFO - OMP_NUM_THREADS=14
+2022-01-06 13:01:51,176 - __main__ - INFO - Using Intel OpenMP
+2022-01-06 13:01:51,177 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2022-01-06 13:01:51,177 - __main__ - INFO - KMP_BLOCKTIME=1
+2022-01-06 13:01:51,177 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+2022-01-06 13:01:51,177 - __main__ - INFO - numactl -C 11-21 -m 0 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20220106130151_instance_0_cores_0-13.log
+
+
+
+
+
+

Usage of Jemalloc/TCMalloc/Default memory allocator

+

Memory allocator influences performance sometime. If users do not designate desired memory allocator, the launch script searches them in the order of TCMalloc > Jemalloc > PyTorch default memory allocator, and takes the first matched one.

+
+

Jemalloc

+

Note: You can set your favorite value to MALLOC_CONF before running the launch script if you do not want to use its default setting.

+
ipexrun --memory-allocator jemalloc --log-dir ./logs resnet50.py
+
+
+

you can confirm usage in log file:

+
2021-07-13 15:30:48,235 - __main__ - INFO - Use JeMallocl memory allocator
+2021-07-13 15:30:48,235 - __main__ - INFO - MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000
+2021-07-13 15:30:48,235 - __main__ - INFO - OMP_NUM_THREADS=44
+2021-07-13 15:30:48,235 - __main__ - INFO - Using Intel OpenMP
+2021-07-13 15:30:48,235 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-07-13 15:30:48,235 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-07-13 15:30:48,235 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so:<VIRTUAL_ENV>/lib/libjemalloc.so
+2021-07-13 15:30:48,236 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] on different NUMA nodes
+2021-07-13 15:30:48,236 - __main__ - INFO - numactl -C 0-43 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210713153048_instance_0_cores_0-43.log
+
+
+
+
+

TCMalloc

+
ipexrun --memory-allocator tcmalloc --log-dir ./logs resnet50.py
+
+
+

you can confirm usage in log file:

+
2021-07-13 15:33:33,654 - __main__ - INFO - Use TCMalloc memory allocator
+2021-07-13 15:33:33,654 - __main__ - INFO - OMP_NUM_THREADS=44
+2021-07-13 15:33:33,654 - __main__ - INFO - Using Intel OpenMP
+2021-07-13 15:33:33,654 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-07-13 15:33:33,654 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-07-13 15:33:33,654 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so:<VIRTUAL_ENV>/lib/libtcmalloc.so
+2021-07-13 15:33:33,654 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] on different NUMA nodes
+2021-07-13 15:33:33,655 - __main__ - INFO - numactl -C 0-43 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210713153333_instance_0_cores_0-43.log
+
+
+
+
+

Default memory allocator

+
ipexrun --memory-allocator default --log-dir ./logs resnet50.py
+
+
+

you can confirm usage in log file:

+
2021-07-13 15:36:59,784 - __main__ - INFO - OMP_NUM_THREADS=44
+2021-07-13 15:36:59,784 - __main__ - INFO - Using Intel OpenMP
+2021-07-13 15:36:59,784 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-07-13 15:36:59,784 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-07-13 15:36:59,784 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+2021-07-13 15:36:59,784 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] on different NUMA nodes
+2021-07-13 15:36:59,784 - __main__ - INFO - numactl -C 0-43 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210713153659_instance_0_cores_0-43.log
+
+
+
+
+
+

Usage of OpenMP library

+
+

Intel OpenMP Library

+

Generally, Intel OpenMP library brings better performance. Thus, in the launch script, Intel OpenMP library is used by default, if it is available. Intel OpenMP library takes environment variables like KMP_AFFINITY and KMP_BLOCKTIME to control its behavior. You can set your favorite values to them before running the launch script if you do not want to use the default settings.

+
+
+

GNU OpenMP Library

+

It is, however, not always that Intel OpenMP library brings better performance comparing to GNU OpenMP library. In this case, you can use knob --omp-runtime default to switch active OpenMP library to the GNU one. GNU OpenMP specific environment variables, OMP_SCHEDULE and OMP_PROC_BIND, for setting CPU affinity are set automatically.

+
ipexrun --omp-runtime default --log-dir ./logs resnet50.py
+
+
+

you can confirm usage in log file:

+
2021-07-13 15:25:00,760 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/<user>/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
+2021-07-13 15:25:00,761 - __main__ - INFO - OMP_SCHEDULE=STATIC
+2021-07-13 15:25:00,761 - __main__ - INFO - OMP_PROC_BIND=CLOSE
+2021-07-13 15:25:00,761 - __main__ - INFO - OMP_NUM_THREADS=44
+2021-07-13 15:25:00,761 - __main__ - WARNING - Numa Aware: cores:['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43'] on different NUMA nodes
+2021-07-13 15:25:00,761 - __main__ - INFO - numactl -C 0-43 <VIRTUAL_ENV>/bin/python resnet50.py 2>&1 | tee ./logs/run_20210713152500_instance_0_cores_0-43.log
+
+
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/performance_tuning/torchserve.html b/cpu/2.4.0+cpu/tutorials/performance_tuning/torchserve.html new file mode 100644 index 000000000..d54613343 --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/performance_tuning/torchserve.html @@ -0,0 +1,483 @@ + + + + + + + TorchServe with Intel® Extension for PyTorch* — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

TorchServe with Intel® Extension for PyTorch*

+

TorchServe can be used with Intel® Extension for PyTorch* to give performance boost on Intel hardware.1 +Here we show how to use TorchServe with Intel® Extension for PyTorch*.

+

1. While Intel® Extension for PyTorch* benefits all platforms, platforms with AVX512 benefit the most.

+
+

Contents of this Document

+ +
+
+

Install Intel® Extension for PyTorch*

+

Refer to the documentation here.

+
+
+

Serving model with Intel® Extension for PyTorch*

+

After installation, all it needs to use TorchServe with Intel® Extension for PyTorch* is to enable it in config.properties.

+
ipex_enable=true
+
+
+

Once Intel® Extension for PyTorch* is enabled, deploying PyTorch model follows the same procedure shown here. TorchServe with Intel® Extension for PyTorch* can deploy any model and do inference.

+
+
+

TorchServe with Launcher

+

Launcher is a script to automate the process of tunining configuration setting on Intel hardware to boost performance. Tuning configurations such as OMP_NUM_THREADS, thread affinity, memory allocator can have a dramatic effect on performance. Refer to Performance Tuning Guide and Launch Script Usage Guide for details on performance tuning with launcher.

+

All it needs to use TorchServe with launcher is to set its configuration in config.properties.

+

Add the following lines in config.properties to use launcher with its default configuration.

+
ipex_enable=true
+cpu_launcher_enable=true
+
+
+

Launcher by default uses numactl if it’s installed to ensure socket is pinned and thus memory is allocated from local numa node. To use launcher without numactl, add the following lines in config.properties.

+
ipex_enable=true
+cpu_launcher_enable=true
+cpu_launcher_args=--disable_numactl
+
+
+

Launcher by default uses only non-hyperthreaded cores if hyperthreading is present to avoid core compute resource sharing. To use launcher with all cores, both physical and logical, add the following lines in config.properties.

+
ipex_enable=true
+cpu_launcher_enable=true
+cpu_launcher_args=--use_logical_core
+
+
+

Below is an example of passing multiple args to cpu_launcher_args.

+
ipex_enable=true
+cpu_launcher_enable=true
+cpu_launcher_args=--use_logical_core --disable_numactl
+
+
+

Below are some useful cpu_launcher_args to note. Italic values are default if applicable.

+
    +
  1. Memory Allocator: [ PTMalloc --use_default_allocator | TCMalloc --enable_tcmalloc | JeMalloc --enable_jemalloc]

    +
      +
    • PyTorch by default uses PTMalloc. TCMalloc/JeMalloc generally gives better performance.

    • +
    +
  2. +
  3. OpenMP library: [GNU OpenMP --disable_iomp | Intel OpenMP]

    +
      +
    • PyTorch by default uses GNU OpenMP. Launcher by default uses Intel OpenMP. Intel OpenMP library generally gives better performance.

    • +
    +
  4. +
  5. Node id: [--node_id]

    +
      +
    • Launcher by default uses all NUMA nodes. Limit memory access to local memories on the Nth Numa node to avoid Non-Uniform Memory Access (NUMA).

    • +
    +
  6. +
+

Refer to Launch Script Usage Guide for a full list of tunable configuration of launcher. And refer to Performance Tuning Guide for more details.

+
+

Launcher Core Pinning to Boost Performance of TorchServe Multi Worker Inference

+

When running multi-worker inference with Torchserve (Required torchserve>=0.6.1), launcher pin cores to workers to boost performance. Internally, launcher equally divides the number of cores by the number of workers such that each worker is pinned to assigned cores. Doing so avoids core overlap among workers which can signficantly boost performance for TorchServe multi-worker inference. For example, assume running 4 workers on a machine with Intel(R) Xeon(R) Platinum 8180 CPU, 2 sockets, 28 cores per socket, 2 threads per core. Launcher will bind worker 0 to cores 0-13, worker 1 to cores 14-27, worker 2 to cores 28-41, and worker 3 to cores 42-55.

+

CPU usage is shown below. 4 main worker threads were launched, each launching 14 threads affinitized to the assigned physical cores. +26

+
+

Scaling workers

+

Additionally when dynamically scaling the number of workers, cores that were pinned to killed workers by the launcher could be left unutilized. To address this problem, launcher internally restarts the workers to re-distribute cores that were pinned to killed workers to the remaining, alive workers. This is taken care internally, so users do not have to worry about this.

+

Continuing with the above example with 4 workers, assume killing workers 2 and 3. If cores were not re-distributed after the scale down, cores 28-55 would be left unutilized. Instead, launcher re-distributes cores 28-55 to workers 0 and 1 such that now worker 0 binds to cores 0-27 and worker 1 binds to cores 28-55.2

+

CPU usage is shown below. 4 main worker threads were initially launched. Then after scaling down the number of workers from 4 to 2, 2 main worker threads were launched, each launching 28 threads affinitized to the assigned physical cores. +worker_scaling

+

2. Serving is interrupted for few seconds while re-distributing cores to scaled workers.

+

Again, all it needs to use TorchServe with launcher core pinning for multiple workers as well as scaling workers is to set its configuration in config.properties.

+

Add the following lines in config.properties to use launcher with its default configuration.

+
cpu_launcher_enable=true
+
+
+
+
+
+
+

Creating and Exporting INT8 model for Intel® Extension for PyTorch*

+

Intel® Extension for PyTorch* supports both eager and torchscript mode. In this section, we show how to deploy INT8 model for Intel® Extension for PyTorch*. Refer to here for more details on Intel® Extension for PyTorch* optimizations for quantization.

+
+

1. Creating a serialized file

+

First create .pt serialized file using Intel® Extension for PyTorch* INT8 inference. Here we show two examples with BERT and ResNet50.

+
+

BERT

+
import torch
+import intel_extension_for_pytorch as ipex
+from transformers import BertModel
+
+# load the model
+model = BertModel.from_pretrained('bert-base-uncased')
+model = model.eval()
+
+# define dummy input tensor to use for the model's forward call to record operations in the model for tracing
+vocab_size = model.config.vocab_size
+batch_size = 1
+seq_length = 384
+dummy_tensor = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+from intel_extension_for_pytorch.quantization import prepare, convert
+
+# ipex supports two quantization schemes: static and dynamic
+# default dynamic qconfig
+qconfig = ipex.quantization.default_dynamic_qconfig
+
+# prepare and calibrate
+model = prepare(model, qconfig, example_inputs=dummy_tensor)
+
+# convert and deploy
+model = convert(model)
+
+with torch.no_grad():
+    model = torch.jit.trace(model, dummy_tensor, check_trace=False, strict=False)
+    model = torch.jit.freeze(model)
+
+torch.jit.save(model, 'bert_int8_jit.pt')
+
+
+
+
+

ResNet50

+
import torch
+import intel_extension_for_pytorch as ipex
+import torchvision.models as models
+
+# load the model
+model = models.resnet50(pretrained=True)
+model = model.eval()
+
+# define dummy input tensor to use for the model's forward call to record operations in the model for tracing
+N, C, H, W = 1, 3, 224, 224
+dummy_tensor = torch.randn(N, C, H, W)
+
+from intel_extension_for_pytorch.quantization import prepare, convert
+
+# ipex supports two quantization schemes: static and dynamic
+# default static qconfig
+qconfig = ipex.quantization.default_static_qconfig
+
+# prepare and calibrate
+model = prepare(model, qconfig, example_inputs=dummy_tensor, inplace=False)
+
+n_iter = 100
+for i in range(n_iter):
+    model(dummy_tensor)
+
+# convert and deploy
+model = convert(model)
+
+with torch.no_grad():
+    model = torch.jit.trace(model, dummy_tensor)
+    model = torch.jit.freeze(model)
+
+torch.jit.save(model, 'rn50_int8_jit.pt')
+
+
+
+
+
+

2. Creating a Model Archive

+

Once the serialized file ( .pt) is created, it can be used with torch-model-archiver as ususal.

+

Use the following command to package rn50_int8_jit.pt into rn50_ipex_int8.mar.

+
torch-model-archiver --model-name rn50_ipex_int8 --version 1.0 --serialized-file rn50_int8_jit.pt --handler image_classifier
+
+
+

Similarly, use the following command in the Huggingface_Transformers directory to package bert_int8_jit.pt into bert_ipex_int8.mar.

+
torch-model-archiver --model-name bert_ipex_int8 --version 1.0 --serialized-file bert_int8_jit.pt --handler ./Transformer_handler_generalized.py --extra-files "./setup_config.json,./Seq_classification_artifacts/index_to_name.json"
+
+
+
+
+

3. Start TorchServe to serve the model

+

Make sure to set ipex_enable=true in config.properties. Use the following command to start TorchServe with Intel® Extension for PyTorch*.

+
torchserve --start --ncs --model-store model_store --ts-config config.properties
+
+
+
+
+

4. Registering and Deploying model

+

Registering and deploying the model follows the same steps shown here.

+
+
+
+

Benchmarking with Launcher

+

Launcher can be used with TorchServe official benchmark to launch server and benchmark requests with optimal configuration on Intel hardware.

+

In this section we provide examples of benchmarking with launcher with its default configuration.

+

Add the following lines to config.properties in the benchmark directory to use launcher with its default setting.

+
ipex_enable=true
+cpu_launcher_enable=true
+
+
+

The rest of the steps for benchmarking follows the same steps shown here.

+

model_log.log contains information and command that were used for this execution launch.

+

CPU usage on a machine with Intel(R) Xeon(R) Platinum 8180 CPU, 2 sockets, 28 cores per socket, 2 threads per core is shown as below: +launcher_default_2sockets

+
$ cat logs/model_log.log
+2021-12-01 21:22:40,096 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/<user>/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
+2021-12-01 21:22:40,096 - __main__ - INFO - OMP_NUM_THREADS=56
+2021-12-01 21:22:40,096 - __main__ - INFO - Using Intel OpenMP
+2021-12-01 21:22:40,096 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-12-01 21:22:40,096 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-12-01 21:22:40,096 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+2021-12-01 21:22:40,096 - __main__ - WARNING - Numa Aware: cores:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55] in different NUMA node
+
+
+

CPU usage on a machine with Intel(R) Xeon(R) Platinum 8375C CPU, 1 socket, 2 cores per socket, 2 threads per socket is shown as below: +launcher_default_1socket

+
$ cat logs/model_log.log
+2021-12-02 06:15:03,981 - __main__ - WARNING - Both TCMalloc and JeMalloc are not found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/<user>/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance
+2021-12-02 06:15:03,981 - __main__ - INFO - OMP_NUM_THREADS=2
+2021-12-02 06:15:03,982 - __main__ - INFO - Using Intel OpenMP
+2021-12-02 06:15:03,982 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2021-12-02 06:15:03,982 - __main__ - INFO - KMP_BLOCKTIME=1
+2021-12-02 06:15:03,982 - __main__ - INFO - LD_PRELOAD=<VIRTUAL_ENV>/lib/libiomp5.so
+
+
+
+

Benchmarking with Launcher Core Pinning

+

As described previously in TorchServe with Launcher, launcher core pinning boosts performance of multi-worker inference. We’ll demonstrate launcher core pinning with TorchServe benchmark, but keep in mind that launcher core pinning is a generic feature applicable to any TorchServe multi-worker inference use casese.

+

For example, assume running 4 workers

+
python benchmark-ab.py --workers 4
+
+
+

on a machine with Intel(R) Xeon(R) Platinum 8180 CPU, 2 sockets, 28 cores per socket, 2 threads per core. Launcher will bind worker 0 to cores 0-13, worker 1 to cores 14-27, worker 2 to cores 28-41, and worker 3 to cores 42-55.

+

All it needs to use TorchServe with launcher’s core pinning is to enable launcher in config.properties.

+

Add the following lines to config.properties in the benchmark directory to use launcher’s core pinning:

+
cpu_launcher_enable=true
+
+
+

CPU usage is shown as below: +launcher_core_pinning

+

4 main worker threads were launched, then each launched a num_physical_cores/num_workers number (14) of threads affinitized to the assigned physical cores.

+

+$ cat logs/model_log.log
+2022-03-24 10:41:32,223 - __main__ - INFO - Use TCMalloc memory allocator
+2022-03-24 10:41:32,223 - __main__ - INFO - OMP_NUM_THREADS=14
+2022-03-24 10:41:32,223 - __main__ - INFO - Using Intel OpenMP
+2022-03-24 10:41:32,223 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2022-03-24 10:41:32,223 - __main__ - INFO - KMP_BLOCKTIME=1
+2022-03-24 10:41:32,223 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so:/lib/libtcmalloc.so
+2022-03-24 10:41:32,223 - __main__ - INFO - numactl -C 0-13 -m 0 /bin/python -u /lib/python/site-packages/ts/model_service_worker.py --sock-type unix --sock-name /tmp/.ts.sock.9000
+
+2022-03-24 10:49:03,760 - __main__ - INFO - Use TCMalloc memory allocator
+2022-03-24 10:49:03,761 - __main__ - INFO - OMP_NUM_THREADS=14
+2022-03-24 10:49:03,762 - __main__ - INFO - Using Intel OpenMP
+2022-03-24 10:49:03,762 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2022-03-24 10:49:03,762 - __main__ - INFO - KMP_BLOCKTIME=1
+2022-03-24 10:49:03,762 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so:/lib/libtcmalloc.so
+2022-03-24 10:49:03,763 - __main__ - INFO - numactl -C 14-27 -m 0 /bin/python -u /lib/python/site-packages/ts/model_service_worker.py --sock-type unix --sock-name /tmp/.ts.sock.9001
+
+2022-03-24 10:49:26,274 - __main__ - INFO - Use TCMalloc memory allocator
+2022-03-24 10:49:26,274 - __main__ - INFO - OMP_NUM_THREADS=14
+2022-03-24 10:49:26,274 - __main__ - INFO - Using Intel OpenMP
+2022-03-24 10:49:26,274 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2022-03-24 10:49:26,274 - __main__ - INFO - KMP_BLOCKTIME=1
+2022-03-24 10:49:26,274 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so:/lib/libtcmalloc.so
+2022-03-24 10:49:26,274 - __main__ - INFO - numactl -C 28-41 -m 1 /bin/python -u /lib/python/site-packages/ts/model_service_worker.py --sock-type unix --sock-name /tmp/.ts.sock.9002
+
+2022-03-24 10:49:42,975 - __main__ - INFO - Use TCMalloc memory allocator
+2022-03-24 10:49:42,975 - __main__ - INFO - OMP_NUM_THREADS=14
+2022-03-24 10:49:42,975 - __main__ - INFO - Using Intel OpenMP
+2022-03-24 10:49:42,975 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
+2022-03-24 10:49:42,975 - __main__ - INFO - KMP_BLOCKTIME=1
+2022-03-24 10:49:42,975 - __main__ - INFO - LD_PRELOAD=/lib/libiomp5.so:/lib/libtcmalloc.so
+2022-03-24 10:49:42,975 - __main__ - INFO - numactl -C 42-55 -m 1 /bin/python -u /lib/python/site-packages/ts/model_service_worker.py --sock-type unix --sock-name /tmp/.ts.sock.9003
+
+
+
+

Performance Boost with Intel® Extension for PyTorch* and Launcher

+

pdt_perf

+

Above shows performance improvement of Torchserve with Intel® Extension for PyTorch* and launcher on ResNet50 and BERT-base-uncased. Torchserve official apache-bench benchmark on Amazon EC2 m6i.24xlarge was used to collect the results2. Add the following lines in config.properties to reproduce the results. Notice that launcher is configured such that a single instance uses all physical cores on a single socket to avoid cross socket communication and core overlap.

+
ipex_enable=true
+cpu_launcher_enable=true
+cpu_launcher_args=--node_id 0 --enable_jemalloc
+
+
+

Use the following command to reproduce the results.

+
python benchmark-ab.py --url {modelUrl} --input {inputPath} --concurrency 1
+
+
+

For example, run the following command to reproduce latency performance of ResNet50 with data type of Intel® Extension for PyTorch* int8 and batch size of 1. Refer to Creating and Exporting INT8 model for Intel® Extension for PyTorch* for steps to creating rn50_ipex_int8.mar file for ResNet50 with Intel® Extension for PyTorch* int8 data type.

+
python benchmark-ab.py --url 'file:///model_store/rn50_ipex_int8.mar' --concurrency 1
+
+
+

For example, run the following command to reproduce latency performance of BERT with data type of Intel® Extension for PyTorch* int8 and batch size of 1. Refer to Creating and Exporting INT8 model for Intel® Extension for PyTorch* for steps to creating bert_ipex_int8.mar file for BERT with Intel® Extension for PyTorch* int8 data type.

+
python benchmark-ab.py --url 'file:///model_store/bert_ipex_int8.mar' --input '../examples/Huggingface_Transformers/Seq_classification_artifacts/sample_text_captum_input.txt' --concurrency 1
+
+
+

3. Amazon EC2 m6i.24xlarge was used for benchmarking purpose only. For multi-core instances, Intel® Extension for PyTorch* optimizations automatically scale and leverage full instance resources.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/performance_tuning/tuning_guide.html b/cpu/2.4.0+cpu/tutorials/performance_tuning/tuning_guide.html new file mode 100644 index 000000000..a05e7756d --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/performance_tuning/tuning_guide.html @@ -0,0 +1,386 @@ + + + + + + + Performance Tuning Guide — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Performance Tuning Guide

+
+

Overview

+

Intel® Extension for PyTorch* is a Python package to extend official PyTorch. It makes the out-of-box user experience of PyTorch CPU better while achieving good performance. To fully utilize the power of Intel® architecture and thus yield high performance, PyTorch, as well as Intel® Extension for PyTorch*, are powered by oneAPI Deep Neural Network Library (oneDNN), an open-source cross-platform performance library of basic building blocks for deep learning applications. It is developed and optimized for Intel Architecture Processors, Intel Processor Graphics, and Xe architecture-based Graphics.

+

Although default primitives of PyTorch and Intel® Extension for PyTorch* are highly optimized, there are things users can do improve performance. Most optimized configurations can be automatically set by the launcher script. This article introduces common methods recommended by Intel developers.

+
+
+

Contents of this Document

+ +
+
+

Hardware Configuration

+

This section briefly introduces the structure of Intel CPUs, as well as concept of Non-Uniform Memory Access (NUMA).

+
+

Intel CPU Structure

+

There are many families of Intel CPUs. We’ll use Intel® Xeon® processor Scalable family as an example to discuss an Intel CPU and how it works. Understanding this background knowledge is helpful to understand the PyTorch optimization methodologies that Intel engineers recommend.

+

On the Intel® Xeon® Scalable Processors with Intel® C620 Series Chipsets, (formerly Purley) platform, each chip provides up to 28 cores. Each core has a non-inclusive last-level cache and an 1MB L2 cache. The CPU features fast 2666 MHz DDR4 memory, six memory channels per CPU, Intel Ultra Path Interconnect (UPI) high speed point-to-point processor interconnect, and more. Figure 1 shows microarchitecture of the Intel® Xeon® processor Scalable family chips. Each CPU chip consists of a number of cores, along with core-specific cache. 6 channels of DDR4 memory are connected to the chip directly. Meanwhile, chips communicates through the Intel UPI interconnect, which features a transfer speed of up to 10.4 GT/s.

+

Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture

+

Figure 1: Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture.

+

Usually, a CPU chip is called a socket. A typical two-socket configuration is illustrated as Figure 2. Two CPU sockets are equipped on one motherboard. Each socket is connected to up to 6 channels of memory, called its local memory, from socket perspective. Sockets are connected to each other via Intel UPI. It is possible for each socket to access memories attached on other sockets, usually called remote memory access. Local memory access is always faster than remote memory access. Meanwhile, cores on one socket share a space of high speed cache memory, which is much faster than communication via Intel UPI. Figure 3 shows an ASUS Z11PA-D8 Intel® Xeon® server motherboard, equipping with two sockets for Intel® Xeon® processor Scalable family CPUs.

+

Typical two-socket configuration

+

Figure 2: Typical two-socket configuration.

+

ASUS Z11PA-D8 Intel® Xeon® server motherboard

+

Figure 3: An ASUS Z11PA-D8 Intel® Xeon® server motherboard. It contains two sockets for Intel® Xeon® processor Scalable family CPUs.

+
+
+

Non-Uniform Memory Access (NUMA)

+

It is a good thing that more and more CPU cores are provided to users in one socket, because this brings more computation resources. However, this also brings memory access competitions. Program can stall because memory is busy to visit. To address this problem, Non-Uniform Memory Access (NUMA) was introduced. Comparing to Uniform Memory Access (UMA), in which scenario all memories are connected to all cores equally, NUMA tells memories into multiple groups. Certain number of memories are directly attached to one socket’s integrated memory controller to become local memory of this socket. As described in the previous section, local memory access is much faster than remote memory access.

+

Users can get CPU information with lscpu command on Linux to learn how many cores, sockets there on the machine. Also, NUMA information like how CPU cores are distributed can also be retrieved. The following is an example of lscpu execution on a machine with two Intel(R) Xeon(R) Platinum 8180M CPUs. 2 sockets were detected. Each socket has 28 physical cores onboard. Since Hyper-Threading is enabled, each core can run 2 threads. I.e. each socket has another 28 logical cores. Thus, there are 112 CPU cores on service. When indexing CPU cores, usually physical cores are indexed before logical core. In this case, the first 28 cores (0-27) are physical cores on the first NUMA socket (node), the second 28 cores (28-55) are physical cores on the second NUMA socket (node). Logical cores are indexed afterward. 56-83 are 28 logical cores on the first NUMA socket (node), 84-111 are the second 28 logical cores on the second NUMA socket (node). Typically, running Intel® Extension for PyTorch* should avoid using logical cores to get a good performance.

+
$ lscpu
+...
+CPU(s):              112
+On-line CPU(s) list: 0-111
+Thread(s) per core:  2
+Core(s) per socket:  28
+Socket(s):           2
+NUMA node(s):        2
+...
+Model name:          Intel(R) Xeon(R) Platinum 8180M CPU @ 2.50GHz
+...
+NUMA node0 CPU(s):   0-27,56-83
+NUMA node1 CPU(s):   28-55,84-111
+...
+
+
+
+
+
+

Software Configuration

+

This section introduces software configurations that helps to boost performance.

+
+

Channels Last

+

Take advantage of Channels Last memory format for image processing tasks. Comparing to PyTorch default NCHW (torch.contiguous_format) memory format, NHWC (torch.channels_last) is more friendly to Intel platforms, and thus generally yields better performance. More detailed introduction can be found at Channels Last page. You can get sample codes with Resnet50 at Example page.

+
+
+

Numactl

+

Since NUMA largely influences memory access performance, this functionality should also be implemented in software side.

+

During development of Linux kernels, more and more sophisticated implementations/optimizations/strategies had been brought out. Version 2.5 of the Linux kernel already contained basic NUMA support, which was further improved in subsequent kernel releases. Version 3.8 of the Linux kernel brought a new NUMA foundation that allowed development of more efficient NUMA policies in later kernel releases. Version 3.13 of the Linux kernel brought numerous policies that aim at putting a process near its memory, together with the handling of cases such as having memory pages shared between processes, or the use of transparent huge pages. New sysctl settings allow NUMA balancing to be enabled or disabled, as well as the configuration of various NUMA memory balancing parameters.[1] Behavior of Linux kernels are thus different according to kernel version. Newer Linux kernels may contain further optimizations of NUMA strategies, and thus have better performances. For some workloads, NUMA strategy influences performance great.

+

Linux provides a tool, numactl, that allows user control of NUMA policy for processes or shared memory. It runs processes with a specific NUMA scheduling or memory placement policy. As described in previous section, cores share high-speed cache in one socket, thus it is a good idea to avoid cross socket computations. From a memory access perspective, bounding memory access locally is much faster than accessing remote memories.

+

The following is an example of numactl usage to run a workload on the Nth socket and limit memory access to its local memories on the Nth socket. More detailed description of numactl command can be found on the numactl man page.

+

numactl --cpunodebind N --membind N python <script>

+

Assume core 0-3 are on socket 0, the following command binds script execution on core 0-3, and binds memory access to socket 0 local memories.

+

numactl --membind 0 -C 0-3 python <script>

+

[1] Wikipedia - Non-uniform memory access

+
+
+

OpenMP

+

OpenMP is an implementation of multithreading, a method of parallelizing where a primary thread (a series of instructions executed consecutively) forks a specified number of sub-threads and the system divides a task among them. The threads then run concurrently, with the runtime environment allocating threads to different processors.[2] Figure 4 illustrates fork-join model of OpenMP execution.

+

A number of parallel block execution threads are forked from primary thread

+

Figure 4: A number of parallel block execution threads are forked from primary thread.

+

Users can control OpenMP behaviors through some environment variables to fit for their workloads. Also, beside GNU OpenMP library (libgomp), Intel provides another OpenMP implementation libiomp for users to choose from. Environment variables that control behavior of OpenMP threads may differ from libgomp and libiomp. They will be introduced separately in sections below.

+

GNU OpenMP (libgomp) is the default multi-threading library for both PyTorch and Intel® Extension for PyTorch*.

+

[2] Wikipedia - OpenMP

+
+

OMP_NUM_THREADS

+

Environment variable OMP_NUM_THREADS sets the number of threads used for parallel regions. By default, it is set to be the number of available physical cores. It can be used along with numactl settings, as the following example. If cores 0-3 are on socket 0, this example command runs <script> on cores 0-3, with 4 OpenMP threads.

+

This environment variable works on both libgomp and libiomp.

+
export OMP_NUM_THREADS=4
+numactl -C 0-3 --membind 0 python <script>
+
+
+
+
+

OMP_THREAD_LIMIT

+

Environment variable OMP_THREAD_LIMIT specifies the number of threads to use for the whole program. The value of this variable shall be a positive integer. If undefined, the number of threads is not limited.

+

Please make sure OMP_THREAD_LIMIT is set to a number equal to or larger than OMP_NUM_THREADS to avoid backward propagation hanging issues.

+
+
+

GNU OpenMP

+

Beside OMP_NUM_THREADS, other GNU OpenMP specific environment variables are commonly used to improve performance:

+
    +
  • GOMP_CPU_AFFINITY: Binds threads to specific CPUs. The variable should contain a space-separated or comma-separated list of CPUs.

  • +
  • OMP_PROC_BIND: Specifies whether threads may be moved between processors. Setting it to CLOSE keeps OpenMP threads close to the primary thread in contiguous place partitions.

  • +
  • OMP_SCHEDULE: Determine how OpenMP threads are scheduled.

  • +
+

Here are recommended settings of these environment variables:

+
export GOMP_CPU_AFFINITY="0-3"
+export OMP_PROC_BIND=CLOSE
+export OMP_SCHEDULE=STATIC
+
+
+
+
+

Intel OpenMP

+

By default, PyTorch uses GNU OpenMP (GNU libgomp) for parallel computation. On Intel platforms, Intel OpenMP Runtime Library (libiomp) provides OpenMP API specification support. It sometimes brings more performance benefits compared to libgomp. Utilizing environment variable LD_PRELOAD can switch OpenMP library to libiomp:

+
export LD_PRELOAD=<path>/libiomp5.so:$LD_PRELOAD
+
+
+

Similar to GNU OpenMP, beside OMP_NUM_THREADS, there are other Intel OpenMP specific environment variables that control behavior of OpenMP threads:

+
    +
  • KMP_AFFINITY

    +

    KMP_AFFINITY controls how to to bind OpenMP threads to physical processing units. Depending on the system (machine) topology, application, and operating system, thread affinity can have a dramatic effect on the application speed.

    +

    A common usage scenario is to bind consecutive threads close together, as is done with KMP_AFFINITY=compact. By doing this, communication overhead, cache line invalidation overhead, and page thrashing are minimized. Now, suppose the application also had a number of parallel regions that did not utilize all of the available OpenMP threads. A thread normally executes faster on a core where it is not competing for resources with another active thread on the same core. It is always good to avoid binding multiple threads to the same core while leaving other cores unused. This can be achieved by the following command. Figure 5 illustrates this strategy.

    +
    export KMP_AFFINITY=granularity=fine,compact,1,0
    +
    +
    +

    KMP_AFFINITY=granularity=fine,compact,1,0

    +

    Figure 5: KMP_AFFINITY=granularity=fine,compact,1,0

    +

    The OpenMP thread n+1 is bound to a thread context as close as possible to OpenMP thread n, but on a different core. Once each core has been assigned one OpenMP thread, the subsequent OpenMP threads are assigned to the available cores in the same order, but they are assigned on different thread contexts.

    +

    It is also possible to bind OpenMP threads to certain CPU cores with the following command.

    +
    export KMP_AFFINITY=granularity=fine,proclist=[N-M],explicit
    +
    +
    +

    More detailed information about KMP_AFFINITY can be found in the Intel CPP Compiler Developer Guide.

    +
  • +
  • KMP_BLOCKTIME

    +

    KMP_BLOCKTIME sets the time, in milliseconds, that a thread, after completing the execution of a parallel region, should wait before sleeping. The default value is 200ms.

    +

    After completing the execution of a parallel region, threads wait for new parallel work to become available. After a certain period of time has elapsed, they stop waiting and sleep. Sleeping allows the threads to be used, until more parallel work becomes available, by non-OpenMP threaded code that may execute between parallel regions, or by other applications. A small KMP_BLOCKTIME value may offer better overall performance if application contains non-OpenMP threaded code that executes between parallel regions. A larger KMP_BLOCKTIME value may be more appropriate if threads are to be reserved solely for use for OpenMP execution, but may penalize other concurrently-running OpenMP or threaded applications. It is suggested to be set to 0 or 1 for convolutional neural network (CNN) based models.

    +
    export KMP_BLOCKTIME=0 (or 1)
    +
    +
    +
  • +
+
+
+
+

Memory Allocator

+

Memory allocator plays an important role from performance perspective as well. A more efficient memory usage reduces overhead on unnecessary memory allocations or destructions, and thus results in a faster execution. From practical experiences, for deep learning workloads, Jemalloc or TCMalloc can get better performance by reusing memory as much as possible than default malloc function.

+

It is as simple as adding path of Jemalloc/TCMalloc dynamic library to environment variable LD_PRELOAD to switch memory allocator to one of them.

+
export LD_PRELOAD=<jemalloc.so/tcmalloc.so>:$LD_PRELOAD
+
+
+
+

Jemalloc

+

Jemalloc is a general purpose malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support. More detailed introduction of performance tuning with Jemalloc can be found at Jemalloc tuning guide

+

A recommended setting for MALLOC_CONF is oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000 from performance perspective. However, in some cases the dirty_decay_ms:9000000000,mmuzzy_decay_ms:9000000000 may cause Out-of-Memory crash. Try oversize_threshold:1,background_thread:true,metadata_thp:auto instead in this case.

+

Getting Jemalloc is straight-forward.

+
conda install -c conda-forge jemalloc
+
+
+
+
+

TCMalloc

+

TCMalloc also features a couple of optimizations to speed up program executions. One of them is holding memory in caches to speed up access of commonly-used objects. Holding such caches even after deallocation also helps avoid costly system calls if such memory is later re-allocated. It is part of gpertools, a collection of a high-performance multi-threaded malloc() implementation, plus some pretty nifty performance analysis tools.

+

Getting TCMalloc is also not complicated.

+
conda install -c conda-forge gperftools
+
+
+
+
+
+

Denormal Number

+

Denormal number is used to store extremely small numbers that are close to 0. Computations with denormal numbers are remarkably slower than normalized number. To solve the low performance issue caused by denormal numbers, users can use the following PyTorch API function.

+
torch.set_flush_denormal(True)
+
+
+
+
+

OneDNN primitive cache

+

Intel® Extension for PyTorch* is using OneDNN backend for those most computing bound PyTorch operators such as Linear and Convolution.

+

To achieve better performance, OneDNN backend is using its primitive cache to store those created primitives for different input shapes during warm-up stage (default primitive cache size is 1024, i.e., 1024 cached primitives). Therefore, when the total size of the primitives created by all the input shapes is within the default threshold, Intel® Extension for PyTorch* could get fully computation performance from OneDNN kernels.

+

Different input shapes usualy come from dynamic shapes of datasets. Dynamic shapes commonly exist in MaskRCNN model (object detection), Transformers Wav2vec2 model (speech-recognition) and other speech/text-generation related Transformers models.

+

However, we might meet the fact that model would need to cache a large amount of various input shapes, which would even exceed the default primitive cache size. In such case, we recommend tuning the OneDNN primitive cache by setting ONEDNN_PRIMITIVE_CACHE_CAPACITY environment variable to get better performance (Note that it is at the cost of increased memory usage):

+
export ONEDNN_PRIMITIVE_CACHE_CAPACITY = {Tuning size}
+//Note that {Tuning size} has an upper limit 65536 cached primitives
+
+
+

Take Transformers Wav2vec2 for speech-recognition as an example, the dataset “common voice” used for inference has a large amount of difference shapes for Convolution operator. In our experiment, the best primitive cache size is 4096, and the model runs with its full speed after being warmed up with inputs of all the shape sizes.

+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/cpu/2.4.0+cpu/tutorials/releases.html b/cpu/2.4.0+cpu/tutorials/releases.html new file mode 100644 index 000000000..3af7ed80e --- /dev/null +++ b/cpu/2.4.0+cpu/tutorials/releases.html @@ -0,0 +1,1411 @@ + + + + + + + Releases — Intel&#174 Extension for PyTorch* 2.4.0+cpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Releases

+
+

2.4.0

+

We are excited to announce the release of Intel® Extension for PyTorch* 2.4.0+cpu which accompanies PyTorch 2.4. This release mainly brings you the support for Llama3.1, basic support for LLM serving frameworks like vLLM/TGI, and a set of optimization to push better performance for LLM models. This release also extends the list of optimized LLM models to a broader level and includes a set of bug fixing and small optimizations. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try this release and feedback as to improve further on this product.

+
+

Highlights

+
    +
  • Llama 3.1 support

  • +
+

Meta has newly released Llama 3.1 with new features like longer context length (128K) support. Intel® Extension for PyTorch* provides support of Llama 3.1 since its launch date with early release version, and now support with this official release.

+
    +
  • Serving framework support

  • +
+

Typical LLM serving frameworks including vLLM, TGI can co-work with Intel® Extension for PyTorch* now which provides optimized performance for Xeon® Scalable CPUs. Besides the integration of LLM serving frameworks with ipex.llm module level APIs, we also continue optimizing the performance and quality of underneath Intel® Extension for PyTorch* operators such as paged attention and flash attention. We also provide new support in ipex.llm module level APIs for 4bits AWQ quantization based on weight only quantization, and distributed communications with shared memory optimization.

+
    +
  • Large Language Model (LLM) optimization:

  • +
+

Intel® Extension for PyTorch* further optimized the performance of the weight only quantization kernels, enabled more fusion pattern variants for LLMs and extended the optimized models to include whisper, falcon-11b, Qwen2, and definitely Llama 3.1, etc. A full list of optimized models can be found at LLM optimization.

+
    +
  • Bug fixing and other optimization

    +
      +
    • Fixed the quantization with auto-mixed-precision (AMP) mode of Qwen-7b #3030

    • +
    • Fixed the illegal memory access issue in the Flash Attention kernel #2987

    • +
    • Re-structured the paths of LLM example scripts #3080

    • +
    • Upgraded oneDNN to v3.5.2 #3143

    • +
    • Misc fix and enhancement #3079 #3116

    • +
    +
  • +
+

Full Changelog: https://github.com/intel/intel-extension-for-pytorch/compare/v2.3.0+cpu…v2.4.0+cpu

+
+
+
+

2.3.100

+
+

Highlights

+
    +
  • Added the optimization for Phi-3: #2883

  • +
  • Fixed the state_dict method patched by ipex.optimize to support DistributedDataParallel #2910

  • +
  • Fixed the linking issue in CPPSDK #2911

  • +
  • Fixed the ROPE kernel for cases where the batch size is larger than one #2928

  • +
  • Upgraded deepspeed to v0.14.3 to include the support for Phi-3 #2985

  • +
+

Full Changelog: https://github.com/intel/intel-extension-for-pytorch/compare/v2.3.0+cpu…v2.3.100+cpu

+
+
+
+

2.3.0

+

We are excited to announce the release of Intel® Extension for PyTorch* 2.3.0+cpu which accompanies PyTorch 2.3. This release mainly brings you the new feature on Large Language Model (LLM) called module level LLM optimization API, which provides module level optimizations for commonly used LLM modules and functionalities, and targets to optimize customized LLM modeling for scenarios like private models, self-customized models, LLM serving frameworks, etc. This release also extends the list of optimized LLM models to a broader level and includes a set of bug fixing and small optimizations. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try this release and feedback as to improve further on this product.

+
+

Highlights

+
    +
  • Large Language Model (LLM) optimization

    +

    Intel® Extension for PyTorch* provides a new feature called module level LLM optimization API, which provides module level optimizations for commonly used LLM modules and functionalities. LLM creators can then use this new API set to replace related parts in models by themselves, with which to reach peak performance.

    +

    There are 3 categories of module level LLM optimization APIs in general:

    +
      +
    • Linear post-op APIs

    • +
    +
    # using module init and forward
    +ipex.llm.modules.linearMul
    +ipex.llm.modules.linearGelu
    +ipex.llm.modules.linearNewGelu
    +ipex.llm.modules.linearAdd
    +ipex.llm.modules.linearAddAdd
    +ipex.llm.modules.linearSilu
    +ipex.llm.modules.linearSiluMul
    +ipex.llm.modules.linear2SiluMul
    +ipex.llm.modules.linearRelu
    +
    +
    +
      +
    • Attention related APIs

    • +
    +
    # using module init and forward
    +ipex.llm.modules.RotaryEmbedding
    +ipex.llm.modules.RMSNorm
    +ipex.llm.modules.FastLayerNorm
    +ipex.llm.modules.VarlenAttention
    +ipex.llm.modules.PagedAttention
    +ipex.llm.modules.IndirectAccessKVCacheAttention
    +
    +# using as functions
    +ipex.llm.functional.rotary_embedding
    +ipex.llm.functional.rms_norm
    +ipex.llm.functional.fast_layer_norm
    +ipex.llm.functional.indirect_access_kv_cache_attention
    +ipex.llm.functional.varlen_attention
    +
    +
    +
      +
    • Generation related APIs

    • +
    +
    # using for optimizing huggingface generation APIs with prompt sharing
    +ipex.llm.generation.hf_beam_sample
    +ipex.llm.generation.hf_beam_search
    +ipex.llm.generation.hf_greedy_search
    +ipex.llm.generation.hf_sample
    +
    +
    +

    More detailed introduction on how to apply this API set and example code walking you through can be found here.

    +
  • +
  • Bug fixing and other optimization

    + +
  • +
+

Full Changelog: https://github.com/intel/intel-extension-for-pytorch/compare/v2.2.0+cpu…v2.3.0+cpu

+
+
+
+

2.2.0

+

We are excited to announce the release of Intel® Extension for PyTorch* 2.2.0+cpu which accompanies PyTorch 2.2. This release mainly brings in our latest optimization on Large Language Model (LLM) including new dedicated API set (ipex.llm), new capability for auto-tuning accuracy recipe for LLM, and a broader list of optimized LLM models, together with a set of bug fixing and small optimization. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try this release and feedback as to improve further on this product.

+
+

Highlights

+
    +
  • Large Language Model (LLM) optimization

    +

    Intel® Extension for PyTorch* provides a new dedicated module, ipex.llm, to host for Large Language Models (LLMs) specific APIs. With ipex.llm, Intel® Extension for PyTorch* provides comprehensive LLM optimization cross various popular datatypes including FP32/BF16/INT8/INT4. Specifically for low precision, both SmoothQuant and Weight-Only quantization are supported for various scenarios. And user can also run Intel® Extension for PyTorch* with Tensor Parallel to fit in the multiple ranks or multiple nodes scenarios to get even better performance.

    +

    A typical API under this new module is ipex.llm.optimize, which is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. ipex.llm.optimize is an upgrade API to replace previous ipex.optimize_transformers, which will bring you more consistent LLM experience and performance. Below shows a simple example of ipex.llm.optimize for fp32 or bf16 inference:

    +
    import torch
    +import intel_extension_for_pytorch as ipex
    +import transformers
    +
    +model= transformers.AutoModelForCausalLM(model_name_or_path).eval()
    +
    +dtype = torch.float # or torch.bfloat16
    +model = ipex.llm.optimize(model, dtype=dtype)
    +
    +model.generate(YOUR_GENERATION_PARAMS)
    +
    +
    +

    More examples of this API can be found at LLM optimization API.

    +

    Besides the new optimization API for LLM inference, Intel® Extension for PyTorch* also provides new capability for users to auto-tune a good quantization recipe for running SmoothQuant INT8 with good accuracy. SmoothQuant is a popular method to improve the accuracy of int8 quantization. The new auto-tune API allows automatic global alpha tuning, and automatic layer-by-layer alpha tuning provided by Intel® Neural Compressor for the best INT8 accuracy. More details can be found at SmoothQuant Recipe Tuning API Introduction.

    +

    Intel® Extension for PyTorch* newly optimized many more LLM models including more llama2 variance like llama2-13b/llama2-70b, encoder-decoder model like T5, code generation models like starcoder/codegen, and more like Baichuan, Baichuan2, ChatGLM2, ChatGLM3, mistral, mpt, dolly, etc.. A full list of optimized models can be found at LLM Optimization.

    +
  • +
  • Bug fixing and other optimization

    + +
  • +
+

Full Changelog: https://github.com/intel/intel-extension-for-pytorch/compare/v2.1.100+cpu…v2.2.0+cpu

+
+
+
+

2.1.100

+
+

Highlights

+ +

Full Changelog: https://github.com/intel/intel-extension-for-pytorch/compare/v2.1.0+cpu…v2.1.100+cpu

+
+
+
+

2.1.0

+
+

Highlights

+
    +
  • Large Language Model (LLM) optimization (Experimental): Intel® Extension for PyTorch* provides a lot of specific optimizations for LLMs in this new release. In operator level, we provide highly efficient GEMM kernel to speedup Linear layer and customized operators to reduce the memory footprint. To better trade-off the performance and accuracy, different low-precision solutions e.g., smoothQuant for INT8 and weight-only-quantization for INT4 and INT8 are also enabled. Besides, tensor parallel can also be adopt to get lower latency for LLMs.

    +

    A new API function, ipex.optimize_transformers, is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. You just need to invoke the ipex.optimize_transformers function instead of the ipex.optimize function to apply all optimizations transparently. More detailed information can be found at Large Language Model optimizations overview.

    +

    Specifically, this new release includes the support of SmoothQuant and weight only quantization (both INT8 weight and INT4 weight) as to provide better performance and accuracy for low precision scenarios.

    +

    A typical usage of this new feature is quite simple as below:

    +
    import torch
    +import intel_extension_for_pytorch as ipex
    +...
    +model = ipex.optimize_transformers(model, dtype=dtype)
    +
    +
    +
  • +
  • torch.compile backend optimization with PyTorch Inductor (Experimental): We optimized Intel® Extension for PyTorch to leverage PyTorch Inductor’s capability when working as a backend of torch.compile, which can better utilize torch.compile’s power of graph capture, Inductor’s scalable fusion capability, and still keep customized optimization from Intel® Extension for PyTorch.

  • +
  • performance optimization of static quantization under dynamic shape: We optimized the static quantization performance of Intel® Extension for PyTorch for dynamic shapes. The usage is the same as the workflow of running static shapes while inputs of variable shapes could be provided during runtime.

  • +
  • Bug fixing and other optimization

    +
      +
    • Optimized the runtime memory usage #1563

    • +
    • Fixed the excessive size of the saved model #1677 #1688

    • +
    • Supported shared parameters in ipex.optimize #1664

    • +
    • Enabled the optimization of LARS fusion #1695

    • +
    • Supported dictionary input in ipex.quantization.prepare #1682

    • +
    • Updated oneDNN to v3.3 #2137

    • +
    +
  • +
+
+
+
+

2.0.100

+
+

Highlights

+
    +
  • Enhanced the functionality of Intel® Extension for PyTorch as a backend of torch.compile: #1568 #1585 #1590

  • +
  • Fixed the Stable Diffusion fine-tuning accuracy issue #1587 #1594

  • +
  • Fixed the ISA check on old hypervisor based VM #1513

  • +
  • Addressed the excessive memory usage in weight prepack #1593

  • +
  • Fixed the weight prepack of convolution when padding_mode is not 'zeros' #1580

  • +
  • Optimized the INT8 LSTM performance #1566

  • +
  • Fixed TransNetV2 calibration failure #1564

  • +
  • Fixed BF16 RNN-T inference when AVX512_CORE_VNNI ISA is used #1592

  • +
  • Fixed the ROIAlign operator #1589

  • +
  • Enabled execution on designated numa nodes with launch script #1517

  • +
+

Full Changelog: https://github.com/intel/intel-extension-for-pytorch/compare/v2.0.0+cpu…v2.0.100+cpu

+
+
+
+

2.0.0

+

We are pleased to announce the release of Intel® Extension for PyTorch* 2.0.0-cpu which accompanies PyTorch 2.0. This release mainly brings in our latest optimization on NLP (BERT), support of PyTorch 2.0’s hero API –- torch.compile as one of its backend, together with a set of bug fixing and small optimization.

+
+

Highlights

+
    +
  • Fast BERT optimization (Experimental): Intel introduced a new technique to speed up BERT workloads. Intel® Extension for PyTorch* integrated this implementation, which benefits BERT model especially training. A new API ipex.fast_bert is provided to try this new optimization. More detailed information can be found at Fast Bert Feature.

  • +
  • MHA optimization with Flash Attention: Intel optimized MHA module with Flash Attention technique as inspired by Stanford paper. This brings less memory consumption for LLM, and also provides better inference performance for models like BERT, Stable Diffusion, etc.

  • +
  • Work with torch.compile as an backend (Experimental): PyTorch 2.0 introduces a new feature, torch.compile, to speed up PyTorch execution. We’ve enabled Intel® Extension for PyTorch as a backend of torch.compile, which can leverage this new PyTorch API’s power of graph capture and provide additional optimization based on these graphs. +The usage of this new feature is quite simple as below:

  • +
+
import torch
+import intel_extension_for_pytorch as ipex
+...
+model = ipex.optimize(model)
+model = torch.compile(model, backend='ipex')
+
+
+
    +
  • Bug fixing and other optimization

    +
      +
    • Supported RMSNorm which is widely used in the t5 model of huggingface #1341

    • +
    • Optimized InstanceNorm #1330

    • +
    • Fixed the quantization of LSTM #1414 #1473

    • +
    • Fixed the correctness issue of unpacking non-contiguous Linear weight #1419

    • +
    • oneDNN update #1488

    • +
    +
  • +
+
+
+

Known Issues

+

Please check at Known Issues webpage.

+
+
+
+

1.13.100

+
+

Highlights

+ +

Full Changelog: https://github.com/intel/intel-extension-for-pytorch/compare/v1.13.0+cpu…v1.13.100+cpu

+
+
+
+

1.13.0

+

We are pleased to announce the release of Intel® Extension for PyTorch* 1.13.0-cpu which accompanies PyTorch 1.13. This release is highlighted with quite a few usability features which help users to get good performance and accuracy on CPU with less effort. We also added a couple of performance features as always. Check out the feature summary below.

+
    +
  • Usability Features

  • +
+
    +
  1. Automatic channels last format conversion: Channels last conversion is now applied automatically to PyTorch modules with ipex.optimize by default. Users don’t have to explicitly convert input and weight for CV models.

  2. +
  3. Code-free optimization (experimental): ipex.optimize is automatically applied to PyTorch modules without the need of code changes when the PyTorch program is started with the Intel® Extension for PyTorch* launcher via the new --auto-ipex option.

  4. +
  5. Graph capture mode of ipex.optimize (experimental): A new boolean flag graph_mode (default off) was added to ipex.optimize, when turned on, converting the eager-mode PyTorch module into graph(s) to get the best of graph optimization.

  6. +
  7. INT8 quantization accuracy autotune (experimental): A new quantization API ipex.quantization.autotune was added to refine the default Intel® Extension for PyTorch* quantization recipe via autotuning algorithms for better accuracy.

  8. +
  9. Hypertune (experimental) is a new tool added on top of Intel® Extension for PyTorch* launcher to automatically identify the good configurations for best throughput via hyper-parameter tuning.

  10. +
  11. ipexrun: The counterpart of torchrun, is a shortcut added for invoking Intel® Extension for PyTorch* launcher.

  12. +
+
    +
  • Performance Features

  • +
+
    +
  1. Packed MKL SGEMM landed as the default kernel option for FP32 Linear, bringing up-to 20% geomean speedup for real-time NLP tasks.

  2. +
  3. DL compiler is now turned on by default with oneDNN fusion and gives additional performance boost for INT8 models.

  4. +
+
+

Highlights

+
    +
  • Automatic channels last format conversion: Channels last conversion is now applied to PyTorch modules automatically with ipex.optimize by default for both training and inference scenarios. Users don’t have to explicitly convert input and weight for CV models.

    +
    import intel_extension_for_pytorch as ipex
    +# No need to do explicitly format conversion
    +# m = m.to(format=torch.channels_last)
    +# x = x.to(format=torch.channels_last)
    +# for inference
    +m = ipex.optimize(m)
    +m(x)
    +# for training
    +m, optimizer = ipex.optimize(m, optimizer)
    +m(x)
    +
    +
    +
  • +
  • Code-free optimization (experimental): ipex.optimize is automatically applied to PyTorch modules without the need of code changes when the PyTorch program is started with the Intel® Extension for PyTorch* launcher via the new --auto-ipex option.

    +

    Example: QA case in HuggingFace

    +
    # original command
    +ipexrun --use_default_allocator --ninstance 2 --ncore_per_instance 28 run_qa.py \
    +  --model_name_or_path bert-base-uncased --dataset_name squad --do_eval \
    +  --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 \
    +  --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/
    +
    +# automatically apply bfloat16 optimization (--auto-ipex --dtype bfloat16)
    +ipexrun --use_default_allocator --ninstance 2 --ncore_per_instance 28 --auto_ipex --dtype bfloat16 run_qa.py \
    +  --model_name_or_path bert-base-uncased --dataset_name squad --do_eval \
    +  --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 \
    +  --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/
    +
    +
    +
  • +
  • Graph capture mode of ipex.optimize (experimental): A new boolean flag graph_mode (default off) was added to ipex.optimize, when turned on, converting the eager-mode PyTorch module into graph(s) to get the best of graph optimization. Under the hood, it combines the goodness of both TorchScript tracing and TorchDynamo to get as max graph scope as possible. Currently, it only supports FP32 and BF16 inference. INT8 inference and training support are under way.

    +
    import intel_extension_for_pytorch as ipex
    +model = ...
    +model.load_state_dict(torch.load(PATH))
    +model.eval()
    +optimized_model = ipex.optimize(model, graph_mode=True)
    +
    +
    +
  • +
  • INT8 quantization accuracy autotune (experimental): A new quantization API ipex.quantization.autotune was added to refine the default Intel® Extension for PyTorch* quantization recipe via autotuning algorithms for better accuracy. This is an optional API to invoke (after prepare and before convert) for scenarios when the accuracy of default quantization recipe of Intel® Extension for PyTorch* cannot meet the requirement. The current implementation is powered by Intel® Neural Compressor.

    +
    import intel_extension_for_pytorch as ipex
    +# Calibrate the model
    +qconfig = ipex.quantization.default_static_qconfig
    +calibrated_model = ipex.quantization.prepare(model_to_be_calibrated, qconfig, example_inputs=example_inputs)
    +for data in calibration_data_set:
    +    calibrated_model(data)
    +# Autotune the model
    +calib_dataloader = torch.utils.data.DataLoader(...)
    +def eval_func(model):
    +    # Return accuracy value
    +    ...
    +    return accuracy
    +tuned_model = ipex.quantization.autotune(
    +                 calibrated_model, calib_dataloader, eval_func,
    +                 sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
    +              )
    +# Convert the model to jit model
    +quantized_model = ipex.quantization.convert(tuned_model)
    +with torch.no_grad():
    +    traced_model = torch.jit.trace(quantized_model, example_input)
    +    traced_model = torch.jit.freeze(traced_model)
    +# Do inference
    +y = traced_model(x)
    +
    +
    +
  • +
  • Hypertune (experimental) is a new tool added on top of Intel® Extension for PyTorch* launcher to automatically identify the good configurations for best throughput via hyper-parameter tuning.

    +
    python -m intel_extension_for_pytorch.cpu.launch.hypertune --conf_file <your_conf_file> <your_python_script> [args]
    +
    +
    +
  • +
+
+
+

Known Issues

+

Please check at Known Issues webpage.

+
+
+
+

1.12.300

+
+

Highlights

+
    +
  • Optimize BF16 MHA fusion to avoid transpose overhead to boost BERT-* BF16 performance #992

  • +
  • Remove 64bytes alignment constraint for FP32 and BF16 AddLayerNorm fusion #992

  • +
  • Fix INT8 RetinaNet accuracy issue #1032

  • +
  • Fix Cat.out issue that does not update the out tensor (#1053) #1074

  • +
+

Full Changelog: https://github.com/intel/intel-extension-for-pytorch/compare/v1.12.100…v1.12.300

+
+
+
+

1.12.100

+

This is a patch release to fix the AVX2 issue that blocks running on non-AVX512 platforms.

+
+
+

1.12.0

+

We are excited to bring you the release of Intel® Extension for PyTorch* 1.12.0-cpu, by tightly following PyTorch 1.12 release. In this release, we matured the automatic int8 quantization and made it a stable feature. We stabilized runtime extension and brought about a MultiStreamModule feature to further boost throughput in offline inference scenario. We also brought about various enhancements in operation and graph which are positive for performance of broad set of workloads.

+

Highlights include:

+
    +
  • Automatic INT8 quantization became a stable feature baking into a well-tuned default quantization recipe, supporting both static and dynamic quantization and a wide range of calibration algorithms.

  • +
  • Runtime Extension, featured MultiStreamModule, became a stable feature, could further enhance throughput in offline inference scenario.

  • +
  • More optimizations in graph and operations to improve performance of broad set of models, examples include but not limited to wave2vec, T5, Albert etc.

  • +
  • Pre-built experimental binary with oneDNN Graph Compiler tuned on would deliver additional performance gain for Bert, Albert, Roberta in INT8 inference.

  • +
+
+

Highlights

+
    +
  • Matured automatic INT8 quantization feature baking into a well-tuned default quantization recipe. We facilitated the user experience and provided a wide range of calibration algorithms like Histogram, MinMax, MovingAverageMinMax, etc. Meanwhile, We polished the static quantization with better flexibility and enabled dynamic quantization as well. Compared to the previous version, the brief changes are as follows. Refer to tutorial page for more details.

    + + + + + + + + + + + +
    v1.11.0-cpuv1.12.0-cpu
    import intel_extension_for_pytorch as ipex
    +# Calibrate the model
    +qconfig = ipex.quantization.QuantConf(qscheme=torch.per_tensor_affine)
    +for data in calibration_data_set:
    +    with ipex.quantization.calibrate(qconfig):
    +        model_to_be_calibrated(x)
    +qconfig.save('qconfig.json')
    +# Convert the model to jit model
    +conf = ipex.quantization.QuantConf('qconfig.json')
    +with torch.no_grad():
    +    traced_model = ipex.quantization.convert(model, conf, example_input)
    +# Do inference
    +y = traced_model(x)
    +
    +
    +
    import intel_extension_for_pytorch as ipex
    +# Calibrate the model
    +qconfig = ipex.quantization.default_static_qconfig # Histogram calibration algorithm and
    +calibrated_model = ipex.quantization.prepare(model_to_be_calibrated, qconfig, example_inputs=example_inputs)
    +for data in calibration_data_set:
    +    calibrated_model(data)
    +# Convert the model to jit model
    +quantized_model = ipex.quantization.convert(calibrated_model)
    +with torch.no_grad():
    +    traced_model = torch.jit.trace(quantized_model, example_input)
    +    traced_model = torch.jit.freeze(traced_model)
    +# Do inference
    +y = traced_model(x)
    +
    +
    +
  • +
  • Runtime Extension, featured MultiStreamModule, became a stable feature. In this release, we enhanced the heuristic rule to further enhance throughput in offline inference scenario. Meanwhile, we also provide the ipex.cpu.runtime.MultiStreamModuleHint to custom how to split the input into streams and concat the output for each steam.

    + + + + + + + + + + + +
    v1.11.0-cpuv1.12.0-cpu
    import intel_extension_for_pytorch as ipex
    +# Create CPU pool
    +cpu_pool = ipex.cpu.runtime.CPUPool(node_id=0)
    +# Create multi-stream model
    +multi_Stream_model = ipex.cpu.runtime.MultiStreamModule(model, num_streams=2, cpu_pool=cpu_pool)
    +
    +
    +
    import intel_extension_for_pytorch as ipex
    +# Create CPU pool
    +cpu_pool = ipex.cpu.runtime.CPUPool(node_id=0)
    +# Optional
    +multi_stream_input_hint = ipex.cpu.runtime.MultiStreamModuleHint(0)
    +multi_stream_output_hint = ipex.cpu.runtime.MultiStreamModuleHint(0)
    +# Create multi-stream model
    +multi_Stream_model = ipex.cpu.runtime.MultiStreamModule(model, num_streams=2, cpu_pool=cpu_pool,
    +  multi_stream_input_hint,   # optional
    +  multi_stream_output_hint ) # optional
    +
    +
    +
  • +
  • Polished the ipex.optimize to accept the input shape information which would conclude the optimal memory layout for better kernel efficiency.

    + + + + + + + + + + + +
    v1.11.0-cpuv1.12.0-cpu
    import intel_extension_for_pytorch as ipex
    +model = ...
    +model.load_state_dict(torch.load(PATH))
    +model.eval()
    +optimized_model = ipex.optimize(model, dtype=torch.bfloat16)
    +
    +
    +
    import intel_extension_for_pytorch as ipex
    +model = ...
    +model.load_state_dict(torch.load(PATH))
    +model.eval()
    +optimized_model = ipex.optimize(model, dtype=torch.bfloat16, sample_input=input)
    +
    +
    +
  • +
  • Provided more optimizations in graph and operations

    +
      +
    • Fuse Adam to improve training performance #822

    • +
    • Enable Normalization operators to support channels-last 3D #642

    • +
    • Support Deconv3D to serve most models and implement most fusions like Conv

    • +
    • Enable LSTM to support static and dynamic quantization #692

    • +
    • Enable Linear to support dynamic quantization #787

    • +
    • Fusions.

      +
        +
      • Fuse Add + Swish to accelerate FSI Riskful model #551

      • +
      • Fuse Conv + LeakyReLU #589

      • +
      • Fuse BMM + Add #407

      • +
      • Fuse Concat + BN + ReLU #647

      • +
      • Optimize Convolution1D to support channels last memory layout and fuse GeLU as its post operation. #657

      • +
      • Fuse Einsum + Add to boost Alphafold2 #674

      • +
      • Fuse Linear + Tanh #711

      • +
      +
    • +
    +
  • +
+
+
+

Known Issues

+
    +
  • RuntimeError: Overflow when unpacking long when a tensor’s min max value exceeds int range while performing int8 calibration. Please customize QConfig to use min-max calibration method.

  • +
  • Calibrating with quantize_per_tensor, when benchmarking with 1 OpenMP* thread, results might be incorrect with large tensors (find more detailed info here. Editing your code following the pseudocode below can workaround this issue, if you do need to explicitly set OMP_NUM_THREAEDS=1 for benchmarking. However, there could be a performance regression if oneDNN graph compiler prototype feature is utilized.

    +

    Workaround pseudocode:

    +
    # perform convert/trace/freeze with omp_num_threads > 1(N)
    +torch.set_num_threads(N)
    +prepared_model = prepare(model, input)
    +converted_model = convert(prepared_model)
    +traced_model = torch.jit.trace(converted_model, input)
    +freezed_model = torch.jit.freeze(traced_model)
    +# run freezed model to apply optimization pass
    +freezed_model(input)
    +
    +# benchmarking with omp_num_threads = 1
    +torch.set_num_threads(1)
    +run_benchmark(freezed_model, input)
    +
    +
    +
  • +
  • Low performance with INT8 support for dynamic shapes +The support for dynamic shapes in Intel® Extension for PyTorch* INT8 integration is still work in progress. When the input shapes are dynamic, for example inputs of variable image sizes in an object detection task or of variable sequence lengths in NLP tasks, the Intel® Extension for PyTorch* INT8 path may slow down the model inference. In this case, use stock PyTorch INT8 functionality. +Note: Using Runtime Extension feature if batch size cannot be divided by number of streams, because mini batch size on each stream are not equivalent, scripts run into this issues.

  • +
  • BF16 AMP(auto-mixed-precision) runs abnormally with the extension on the AVX2-only machine if the topology contains Conv, Matmul, Linear, and BatchNormalization

  • +
  • Runtime extension of MultiStreamModule doesn’t support DLRM inference, since the input of DLRM (EmbeddingBag specifically) can’t be simplely batch split.

  • +
  • Runtime extension of MultiStreamModule has poor performance of RNNT Inference comparing with native throughput mode. Only part of the RNNT models (joint_net specifically) can be jit traced into graph. However, in one batch inference, joint_net is invoked multi times. It increases the overhead of MultiStreamModule as input batch split, thread synchronization and output concat.

  • +
  • Incorrect Conv and Linear result if the number of OMP threads is changed at runtime +The oneDNN memory layout depends on the number of OMP threads, which requires the caller to detect the changes for the # of OMP threads while this release has not implemented it yet.

  • +
  • Low throughput with DLRM FP32 Train +A ‘Sparse Add’ PR is pending on review. The issue will be fixed when the PR is merged.

  • +
  • If inference is done with a custom function, conv+bn folding feature of the ipex.optimize() function doesn’t work.

    +
    import torch
    +import intel_pytorch_extension as ipex
    +class Module(torch.nn.Module):
    +    def __init__(self):
    +        super(Module, self).__init__()
    +        self.conv = torch.nn.Conv2d(1, 10, 5, 1)
    +        self.bn = torch.nn.BatchNorm2d(10)
    +        self.relu = torch.nn.ReLU()
    +    def forward(self, x):
    +        x = self.conv(x)
    +        x = self.bn(x)
    +        x = self.relu(x)
    +        return x
    +    def inference(self, x):
    +        return self.forward(x)
    +if __name__ == '__main__':
    +    m = Module()
    +    m.eval()
    +    m = ipex.optimize(m, dtype=torch.float32, level="O0")
    +    d = torch.rand(1, 1, 112, 112)
    +    with torch.no_grad():
    +      m.inference(d)
    +
    +
    +

    This is a PyTorch FX limitation. You can avoid this error by calling m = ipex.optimize(m, level="O0"), which doesn’t apply ipex optimization, or disable conv+bn folding by calling m = ipex.optimize(m, level="O1", conv_bn_folding=False).

    +
  • +
+
+
+
+

1.11.200

+
+

Highlights

+
    +
  • Enable more fused operators to accelerate particular models.

  • +
  • Fuse Convolution and LeakyReLU (#648)

  • +
  • Support torch.einsum and fuse it with add (#684)

  • +
  • Fuse Linear and Tanh (#685)

  • +
  • In addition to the original installation methods, this release provides Docker installation from DockerHub.

  • +
  • Provided the evaluation wheel packages that could boost performance for selective topologies on top of oneDNN graph compiler prototype feature. +NOTE: This is still at an early development stage and not fully mature yet, but feel free to reach out through GitHub issues if you have any suggestions.

  • +
+

Full Changelog

+
+
+
+

1.11.0

+

We are excited to announce Intel® Extension for PyTorch* 1.11.0-cpu release by tightly following PyTorch 1.11 release. Along with extension 1.11, we focused on continually improving OOB user experience and performance. Highlights include:

+
    +
  • Support a single binary with runtime dynamic dispatch based on AVX2/AVX512 hardware ISA detection

  • +
  • Support install binary from pip with package name only (without the need of specifying the URL)

  • +
  • Provide the C++ SDK installation to facilitate ease of C++ app development and deployment

  • +
  • Add more optimizations, including graph fusions for speeding up Transformer-based models and CNN, etc

  • +
  • Reduce the binary size for both the PIP wheel and C++ SDK (2X to 5X reduction from the previous version)

  • +
+
+

Highlights

+
    +
  • Combine the AVX2 and AVX512 binary as a single binary and automatically dispatch to different implementations based on hardware ISA detection at runtime. The typical case is to serve the data center that mixtures AVX2-only and AVX512 platforms. It does not need to deploy the different ISA binary now compared to the previous version

    +

    NOTE: The extension uses the oneDNN library as the backend. However, the BF16 and INT8 operator sets and features are different between AVX2 and AVX512. Refer to oneDNN document for more details.

    +
    +

    When one input is of type u8, and the other one is of type s8, oneDNN assumes the user will choose the quantization parameters so no overflow/saturation occurs. For instance, a user can use u7 [0, 127] instead of u8 for the unsigned input, or s7 [-64, 63] instead of the s8 one. It is worth mentioning that this is required only when the Intel AVX2 or Intel AVX512 Instruction Set is used.

    +
    +
  • +
  • The extension wheel packages have been uploaded to pypi.org. The user could directly install the extension by pip/pip3 without explicitly specifying the binary location URL.

  • +
+ + + + + + + + + + + +
v1.10.100-cpuv1.11.0-cpu
python -m pip install intel_extension_for_pytorch==1.10.100 -f https://software.intel.com/ipex-whl-stable
+
+
+
pip install intel_extension_for_pytorch
+
+
+
    +
  • Compared to the previous version, this release provides a dedicated installation file for the C++ SDK. The installation file automatically detects the PyTorch C++ SDK location and installs the extension C++ SDK files to the PyTorch C++ SDK. The user does not need to manually add the extension C++ SDK source files and CMake to the PyTorch SDK. In addition to that, the installation file reduces the C++ SDK binary size from ~220MB to ~13.5MB.

  • +
+ + + + + + + + + + + +
v1.10.100-cpuv1.11.0-cpu
intel-ext-pt-cpu-libtorch-shared-with-deps-1.10.0+cpu.zip (220M)
+intel-ext-pt-cpu-libtorch-cxx11-abi-shared-with-deps-1.10.0+cpu.zip (224M)
+
+
+
libintel-ext-pt-1.11.0+cpu.run (13.7M)
+libintel-ext-pt-cxx11-abi-1.11.0+cpu.run (13.5M)
+
+
+
    +
  • Add more optimizations, including more custom operators and fusions.

    +
      +
    • Fuse the QKV linear operators as a single Linear to accelerate the Transformer*(BERT-*) encoder part - #278.

    • +
    • Remove Multi-Head-Attention fusion limitations to support the 64bytes unaligned tensor shape. #531

    • +
    • Fold the binary operator to Convolution and Linear operator to reduce computation. #432 #438 #602

    • +
    • Replace the outplace operators with their corresponding in-place version to reduce memory footprint. The extension currently supports the operators including sliu, sigmoid, tanh, hardsigmoid, hardswish, relu6, relu, selu, softmax. #524

    • +
    • Fuse the Concat + BN + ReLU as a single operator. #452

    • +
    • Optimize Conv3D for both imperative and JIT by enabling NHWC and pre-packing the weight. #425

    • +
    +
  • +
  • Reduce the binary size. C++ SDK is reduced from ~220MB to ~13.5MB while the wheel packaged is reduced from ~100MB to ~40MB.

  • +
  • Update oneDNN and oneDNN graph to 2.5.2 and 0.4.2 respectively.

  • +
+
+
+

What’s Changed

+

Full Changelog: https://github.com/intel/intel-extension-for-pytorch/compare/v1.10.100…v1.11.0

+
+
+
+

1.10.100

+

This release is meant to fix the following issues:

+
    +
  • Resolve the issue that the PyTorch Tensor Expression(TE) did not work after importing the extension.

  • +
  • Wraps the BactchNorm(BN) as another operator to break the TE’s BN-related fusions. Because the BatchNorm performance of PyTorch Tensor Expression can not achieve the same performance as PyTorch ATen BN.

  • +
  • Update the documentation

    +
      +
    • Fix the INT8 quantization example issue #205

    • +
    • Polish the installation guide

    • +
    +
  • +
+
+
+

1.10.0

+

The Intel® Extension for PyTorch* 1.10 is on top of PyTorch 1.10. In this release, we polished the front end APIs. The APIs are more simple, stable, and straightforward now. According to PyTorch community recommendation, we changed the underhood device from XPU to CPU. With this change, the model and tensor does not need to be converted to the extension device to get performance improvement. It simplifies the model changes.

+

Besides that, we continuously optimize the Transformer* and CNN models by fusing more operators and applying NHWC. We measured the 1.10 performance on Torchvison and HugginFace. As expected, 1.10 can speed up the two model zones.

+
+

Highlights

+
    +
  • Change the package name to intel_extension_for_pytorch while the original package name is intel_pytorch_extension. This change targets to avoid any potential legal issues.

  • +
+ + + + + + + + + + + +
v1.9.0-cpuv1.10.0-cpu
import intel_extension_for_pytorch as ipex
+
+
+
import intel_extension_for_pytorch as ipex
+
+
+
    +
  • The underhood device is changed from the extension-specific device(XPU) to the standard CPU device that aligns with the PyTorch CPU device design, regardless of the dispatch mechanism and operator register mechanism. The means the model does not need to be converted to the extension device explicitly.

  • +
+ + + + + + + + + + + +
v1.9.0-cpuv1.10.0-cpu
import torch
+import torchvision.models as models
+
+# Import the extension
+import intel_extension_for_pytorch as ipex
+
+resnet18 = models.resnet18(pretrained = True)
+
+# Explicitly convert the model to the extension device
+resnet18_xpu = resnet18.to(ipex.DEVICE)
+
+
+
import torch
+import torchvision.models as models
+
+# Import the extension
+import intel_extension_for_pytorch as ipex
+
+resnet18 = models.resnet18(pretrained = True)
+
+
+
    +
  • Compared to v1.9.0, v1.10.0 follows PyTorch AMP API(torch.cpu.amp) to support auto-mixed-precision. torch.cpu.amp provides convenience for auto data type conversion at runtime. Currently, torch.cpu.amp only supports torch.bfloat16. It is the default lower precision floating point data type when torch.cpu.amp is enabled. torch.cpu.amp primarily benefits on Intel CPU with BFloat16 instruction set support.

  • +
+
import torch
+class SimpleNet(torch.nn.Module):
+    def __init__(self):
+        super(SimpleNet, self).__init__()
+        self.conv = torch.nn.Conv2d(64, 128, (3, 3), stride=(2, 2), padding=(1, 1), bias=False)
+
+    def forward(self, x):
+        return self.conv(x)
+
+
+ + + + + + + + + + + +
v1.9.0-cpuv1.10.0-cpu
# Import the extension
+import intel_extension_for_pytorch as ipex
+
+# Automatically mix precision
+ipex.enable_auto_mixed_precision(mixed_dtype = torch.bfloat16)
+
+model = SimpleNet().eval()
+x = torch.rand(64, 64, 224, 224)
+with torch.no_grad():
+    model = torch.jit.trace(model, x)
+    model = torch.jit.freeze(model)
+    y = model(x)
+
+
+
# Import the extension
+import intel_extension_for_pytorch as ipex
+
+model = SimpleNet().eval()
+x = torch.rand(64, 64, 224, 224)
+with torch.cpu.amp.autocast(), torch.no_grad():
+    model = torch.jit.trace(model, x)
+    model = torch.jit.freeze(model)
+    y = model(x)
+
+
+
    +
  • The 1.10 release provides the INT8 calibration as an experimental feature while it only supports post-training static quantization now. Compared to 1.9.0, the fronted APIs for quantization is more straightforward and ease-of-use.

  • +
+
import torch
+import torch.nn as nn
+import intel_extension_for_pytorch as ipex
+
+class MyModel(nn.Module):
+    def __init__(self):
+        super(MyModel, self).__init__()
+        self.conv = nn.Conv2d(10, 10, 3)
+
+    def forward(self, x):
+        x = self.conv(x)
+        return x
+
+model = MyModel().eval()
+
+# user dataset for calibration.
+xx_c = [torch.randn(1, 10, 28, 28) for i in range(2))
+# user dataset for validation.
+xx_v = [torch.randn(1, 10, 28, 28) for i in range(20))
+
+
+
    +
  • Clibration

  • +
+ + + + + + + + + + + +
v1.9.0-cpuv1.10.0-cpu
# Import the extension
+import intel_extension_for_pytorch as ipex
+
+# Convert the model to the Extension device
+model = Model().to(ipex.DEVICE)
+
+# Create a configuration file to save quantization parameters.
+conf = ipex.AmpConf(torch.int8)
+with torch.no_grad():
+    for x in xx_c:
+        # Run the model under calibration mode to collect quantization parameters
+        with ipex.AutoMixPrecision(conf, running_mode='calibration'):
+            y = model(x.to(ipex.DEVICE))
+# Save the configuration file
+conf.save('configure.json')
+
+
+
# Import the extension
+import intel_extension_for_pytorch as ipex
+
+conf = ipex.quantization.QuantConf(qscheme=torch.per_tensor_affine)
+with torch.no_grad():
+    for x in xx_c:
+        with ipex.quantization.calibrate(conf):
+            y = model(x)
+
+conf.save('configure.json')
+
+
+
    +
  • Inference

  • +
+ + + + + + + + + + + +
v1.9.0-cpuv1.10.0-cpu
# Import the extension
+import intel_extension_for_pytorch as ipex
+
+# Convert the model to the Extension device
+model = Model().to(ipex.DEVICE)
+conf = ipex.AmpConf(torch.int8, 'configure.json')
+with torch.no_grad():
+    for x in cali_dataset:
+        with ipex.AutoMixPrecision(conf, running_mode='inference'):
+            y = model(x.to(ipex.DEVICE))
+
+
+
# Import the extension
+import intel_extension_for_pytorch as ipex
+
+conf = ipex.quantization.QuantConf('configure.json')
+
+with torch.no_grad():
+    trace_model = ipex.quantization.convert(model, conf, example_input)
+    for x in xx_v:
+        y = trace_model(x)
+
+
+
    +
  • This release introduces the optimize API at python front end to optimize the model and optimizer for training. The new API both supports FP32 and BF16, inference and training.

  • +
  • Runtime Extension (Experimental) provides a runtime CPU pool API to bind threads to cores. It also features async tasks. Note: Intel® Extension for PyTorch* Runtime extension is still in the experimental stage. The API is subject to change. More detailed descriptions are available in the extension documentation.

  • +
+
+
+

Known Issues

+
    +
  • omp_set_num_threads function failed to change OpenMP threads number of oneDNN operators if it was set before.

    +

    omp_set_num_threads function is provided in Intel® Extension for PyTorch* to change the number of threads used with OpenMP. However, it failed to change the number of OpenMP threads if it was set before.

    +

    pseudo-code:

    +
    omp_set_num_threads(6)
    +model_execution()
    +omp_set_num_threads(4)
    +same_model_execution_again()
    +
    +
    +

    Reason: oneDNN primitive descriptor stores the omp number of threads. Current oneDNN integration caches the primitive descriptor in IPEX. So if we use runtime extension with oneDNN based pytorch/ipex operation, the runtime extension fails to change the used omp number of threads.

    +
  • +
  • Low performance with INT8 support for dynamic shapes

    +

    The support for dynamic shapes in Intel® Extension for PyTorch* INT8 integration is still work in progress. When the input shapes are dynamic, for example, inputs of variable image sizes in an object detection task or of variable sequence lengths in NLP tasks, the Intel® Extension for PyTorch* INT8 path may slow down the model inference. In this case, use stock PyTorch INT8 functionality.

    +
  • +
  • Low throughput with DLRM FP32 Train

    +

    A ‘Sparse Add’ PR is pending review. The issue will be fixed when the PR is merged.

    +
  • +
+
+
+

What’s Changed

+

Full Changelog: https://github.com/intel/intel-extension-for-pytorch/compare/v1.9.0…v1.10.0+cpu-rc3

+
+
+
+

1.9.0

+
+

What’s New

+
    +
  • Rebased the Intel Extension for Pytorch from PyTorch-1.8.0 to the official PyTorch-1.9.0 release.

  • +
  • Support binary installation.

    +

    python -m pip install torch_ipex==1.9.0 -f https://software.intel.com/ipex-whl-stable

    +
  • +
  • Support the C++ library. The third party App can link the Intel-Extension-for-PyTorch C++ library to enable the particular optimizations.

  • +
+
+
+
+

1.8.0

+
+

What’s New

+
    +
  • Rebased the Intel Extension for Pytorch from Pytorch -1.7.0 to the official Pytorch-1.8.0 release. The new XPU device type has been added into Pytorch-1.8.0(49786), don’t need to patch PyTorch to enable Intel Extension for Pytorch anymore

  • +
  • Upgraded the oneDNN from v1.5-rc to v1.8.1

  • +
  • Updated the README file to add the sections to introduce supported customized operators, supported fusion patterns, tutorials, and joint blogs with stakeholders

  • +
+
+
+
+

1.2.0

+
+

What’s New

+
    +
  • We rebased the Intel Extension for pytorch from Pytorch -1.5rc3 to the official Pytorch-1.7.0 release. It will have performance improvement with the new Pytorch-1.7 support.

  • +
  • Device name was changed from DPCPP to XPU.

    +

    We changed the device name from DPCPP to XPU to align with the future Intel GPU product for heterogeneous computation.

    +
  • +
  • Enabled the launcher for end users.

  • +
  • We enabled the launch script that helps users launch the program for training and inference, then automatically setup the strategy for multi-thread, multi-instance, and memory allocator. Refer to the launch script comments for more details.

  • +
+
+
+

Performance Improvement

+
    +
  • This upgrade provides better INT8 optimization with refined auto mixed-precision API.

  • +
  • More operators are optimized for the int8 inference and bfp16 training of some key workloads, like MaskRCNN, SSD-ResNet34, DLRM, RNNT.

  • +
+
+
+

Others

+
    +
  • Bug fixes

    +
      +
    • This upgrade fixes the issue that saving the model trained by Intel extension for PyTorch caused errors.

    • +
    • This upgrade fixes the issue that Intel extension for PyTorch was slower than pytorch proper for Tacotron2.

    • +
    +
  • +
  • New custom operators

    +

    This upgrade adds several custom operators: ROIAlign, RNN, FrozenBatchNorm, nms.

    +
  • +
  • Optimized operators/fusion

    +

    This upgrade optimizes several operators: tanh, log_softmax, upsample, and embeddingbad and enables int8 linear fusion.

    +
  • +
  • Performance

    +

    The release has daily automated testing for the supported models: ResNet50, ResNext101, Huggingface Bert, DLRM, Resnext3d, MaskRNN, SSD-ResNet34. With the extension imported, it can bring up to 2x INT8 over FP32 inference performance improvements on the 3rd Gen Intel Xeon scalable processors (formerly codename Cooper Lake).

    +
  • +
+
+
+

Known issues

+
    +
  • Multi-node training still encounter hang issues after several iterations. The fix will be included in the next official release.

  • +
+
+
+
+

1.1.0

+
+

What’s New

+
    +
  • Added optimization for training with FP32 data type & BF16 data type. All the optimized FP32/BF16 backward operators include:

    +
      +
    • Conv2d

    • +
    • Relu

    • +
    • Gelu

    • +
    • Linear

    • +
    • Pooling

    • +
    • BatchNorm

    • +
    • LayerNorm

    • +
    • Cat

    • +
    • Softmax

    • +
    • Sigmoid

    • +
    • Split

    • +
    • Embedding_bag

    • +
    • Interaction

    • +
    • MLP

    • +
    +
  • +
  • More fusion patterns are supported and validated in the release, see table:

  • +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Fusion PatternsRelease
Conv + Sumv1.0
Conv + BNv1.0
Conv + Reluv1.0
Linear + Reluv1.0
Conv + Eltwisev1.1
Linear + Geluv1.1
    +
  • Add docker support

  • +
  • [Alpha] Multi-node training with oneCCL support.

  • +
  • [Alpha] INT8 inference optimization.

  • +
+
+
+

Performance

+
    +
  • The release has daily automated testing for the supported models: ResNet50, ResNext101, Huggingface Bert, DLRM, Resnext3d, Transformer. With the extension imported, it can bring up to 1.2x~1.7x BF16 over FP32 training performance improvements on the 3rd Gen Intel Xeon scalable processors (formerly codename Cooper Lake).

  • +
+
+
+

Known issue

+
    +
  • Some workloads may crash after several iterations on the extension with jemalloc enabled.

  • +
+
+
+
+

1.0.2

+
    +
  • Rebase torch CCL patch to PyTorch 1.5.0-rc3

  • +
+
+
+

1.0.1-Alpha

+
    +
  • Static link oneDNN library

  • +
  • Check AVX512 build option

  • +
  • Fix the issue that cannot normally invoke enable_auto_optimization

  • +
+
+
+

1.0.0-Alpha

+
+

What’s New

+
    +
  • Auto Operator Optimization

    +

    Intel Extension for PyTorch will automatically optimize the operators of PyTorch when importing its python package. It will significantly improve the computation performance if the input tensor and the model is converted to the extension device.

    +
  • +
  • Auto Mixed Precision +Currently, the extension has supported bfloat16. It streamlines the work to enable a bfloat16 model. The feature is controlled by enable_auto_mix_precision. If you enable it, the extension will run the operator with bfloat16 automatically to accelerate the operator computation.

  • +
+
+
+

Performance Result

+

We collected the performance data of some models on the Intel Cooper Lake platform with 1 socket and 28 cores. Intel Cooper Lake introduced AVX512 BF16 instructions that could improve the bfloat16 computation significantly. The detail is as follows (The data is the speedup ratio and the baseline is upstream PyTorch).

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Imperative - Operator InjectionImperative - Mixed PrecisionJIT- Operator InjectionJIT - Mixed Precision
RN502.685.015.149.66
ResNet3D3.004.675.198.39
BERT-LARGE0.991.40N/AN/A

We also measured the performance of ResNeXt101, Transformer-FB, DLRM, and YOLOv3 with the extension. We observed that the performance could be significantly improved by the extension as expected.

+
+
+

Known issue

+
    +
  • #10 All data types have not been registered for DPCPP

  • +
  • #37 MaxPool can’t get nan result when input’s value is nan

  • +
+
+
+

NOTE

+

The extension supported PyTorch v1.5.0-rc3. Support for other PyTorch versions is working in progress.

+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file