diff --git a/xpu/2.3.110+xpu/.buildinfo b/xpu/2.3.110+xpu/.buildinfo new file mode 100644 index 000000000..43dcfab23 --- /dev/null +++ b/xpu/2.3.110+xpu/.buildinfo @@ -0,0 +1,4 @@ +# Sphinx build info version 1 +# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. +config: e758712e5f73bcd66385e41feaf1659f +tags: 645f666f9bcd5a90fca523b33c5a78b7 diff --git a/xpu/2.3.110+xpu/_images/compute_eng_arc.png b/xpu/2.3.110+xpu/_images/compute_eng_arc.png new file mode 100644 index 000000000..2a13fa704 Binary files /dev/null and b/xpu/2.3.110+xpu/_images/compute_eng_arc.png differ diff --git a/xpu/2.3.110+xpu/_images/figure1_DLPack_import.svg b/xpu/2.3.110+xpu/_images/figure1_DLPack_import.svg new file mode 100644 index 000000000..6276f2221 --- /dev/null +++ b/xpu/2.3.110+xpu/_images/figure1_DLPack_import.svg @@ -0,0 +1,746 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Subprocess + + + + + Round Corner Rectangle.1002 + PyTorchTensor + + + + + + + + + + + + + + + + + + + + + + + PyTorchTensor + + Round Corner Rectangle.1003 + sizes + + + + + + + + + + + + + + + + + + + + + + + sizes + + Round Corner Rectangle.1004 + strides + + + + + + + + + + + + + + + + + + + + + + + strides + + Round Corner Rectangle.1005 + dtype + + + + + + + + + + + + + + + + + + + + + + + dtype + + Round Corner Rectangle.1006 + device + + + + + + + + + + + + + + + + + + + + + + + device + + Round Corner Rectangle.1007 + data_ptr + + + + + + + + + + + + + + + + + + + + + + + data_ptr + + Sheet.2006 + + + + + + + Sheet.2 + + + + + + + + + Dynamic connector.1011 + + + + + + + Dynamic connector.1012 + + + + + + + Sheet.2009 + + + + + + + Round Corner Rectangle.1014 + deleter + + + + + + + + + + + + + + + + + + + + + + + deleter + + Dynamic connector.1015 + + + + + + + Sheet.2012 + + + + + + + Round Corner Rectangle.1017 + dim + + + + + + + + + + + + + + + + + + + + + + + dim + + Sheet.2014 + + + + + + + Sheet.2015 + + + + + + + Sheet.2016 + + + + + + + Sheet.2017 + + + + + + + Sheet.2018 + + + + + + + Sheet.2019 + Handled by PyTorch + + + + + + + Handled by PyTorch + + Sheet.2020 + + + + + + + Sheet.2021 + Handled by Intel® Extension for PyTorch* + + + + + + + Handled by Intel® Extension for PyTorch* + + Sheet.3 + + + + + + + + + Dynamic connector.2022 + + + + + + + diff --git a/xpu/2.3.110+xpu/_images/figure1_memory_layout.png b/xpu/2.3.110+xpu/_images/figure1_memory_layout.png new file mode 100644 index 000000000..b37006bbc Binary files /dev/null and b/xpu/2.3.110+xpu/_images/figure1_memory_layout.png differ diff --git a/xpu/2.3.110+xpu/_images/figure2_DLPack_export.svg b/xpu/2.3.110+xpu/_images/figure2_DLPack_export.svg new file mode 100644 index 000000000..6d0f80a63 --- /dev/null +++ b/xpu/2.3.110+xpu/_images/figure2_DLPack_export.svg @@ -0,0 +1,876 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Page-3 + + + + Round Corner Rectangle.1002 + PyTorchTensor + + + + + + + + + + + + + + + + + + + + + + + PyTorchTensor + + Round Corner Rectangle.1003 + sizes + + + + + + + + + + + + + + + + + + + + + + + sizes + + Round Corner Rectangle.1004 + strides + + + + + + + + + + + + + + + + + + + + + + + strides + + Round Corner Rectangle.1005 + dtype + + + + + + + + + + + + + + + + + + + + + + + dtype + + Round Corner Rectangle.1006 + device + + + + + + + + + + + + + + + + + + + + + + + device + + Round Corner Rectangle.1007 + data_ptr + + + + + + + + + + + + + + + + + + + + + + + data_ptr + + Sheet.2006 + + + + + + + Sheet.2 + + + + + + + + + Dynamic connector.1011 + + + + + + + Dynamic connector.1012 + + + + + + + Sheet.2009 + + + + + + + Round Corner Rectangle.1014 + deleter + + + + + + + + + + + + + + + + + + + + + + + deleter + + Dynamic connector.1015 + + + + + + + Sheet.2012 + + + + + + + Round Corner Rectangle.1017 + dim + + + + + + + + + + + + + + + + + + + + + + + dim + + Sheet.2014 + + + + + + + Sheet.2015 + + + + + + + Sheet.2016 + + + + + + + Sheet.2017 + + + + + + + Round Corner Rectangle.2003 + ATenDLMTensor + + + + + + + + + + + + + + + + + + + + + + + ATenDLMTensor + + Round Corner Rectangle.2004 + PyTorchTensor + + + + + + + + + + + + + + + + + + + + + + + PyTorchTensor + + Round Corner Rectangle.2005 + DLManagedTensor + + + + + + + + + + + + + + + + + + + + + + + DLManagedTensor + + Dynamic connector.2006 + + + + + + + Sheet.2022 + + + + + + + Sheet.2023 + + + + + + + Sheet.2025 + Handled by PyTorch + + + + + + + Handled by PyTorch + + Sheet.2026 + + + + + + + Sheet.2027 + Handled by Intel® Extension for PyTorch* + + + + + + + Handled by Intel® Extension for PyTorch* + + Sheet.3 + + + + + + + + + Dynamic connector.2022 + + + + + + + Sheet.2029 + + + + + + + diff --git a/xpu/2.3.110+xpu/_images/figure2_dispatch.png b/xpu/2.3.110+xpu/_images/figure2_dispatch.png new file mode 100644 index 000000000..b6f56c1b4 Binary files /dev/null and b/xpu/2.3.110+xpu/_images/figure2_dispatch.png differ diff --git a/xpu/2.3.110+xpu/_images/figure3_strided_layout.png b/xpu/2.3.110+xpu/_images/figure3_strided_layout.png new file mode 100644 index 000000000..fc7491254 Binary files /dev/null and b/xpu/2.3.110+xpu/_images/figure3_strided_layout.png differ diff --git a/xpu/2.3.110+xpu/_images/four_card.png b/xpu/2.3.110+xpu/_images/four_card.png new file mode 100644 index 000000000..9ebd723ca Binary files /dev/null and b/xpu/2.3.110+xpu/_images/four_card.png differ diff --git a/xpu/2.3.110+xpu/_images/intel_extension_for_pytorch_structure.png b/xpu/2.3.110+xpu/_images/intel_extension_for_pytorch_structure.png new file mode 100644 index 000000000..7b3dd6b32 Binary files /dev/null and b/xpu/2.3.110+xpu/_images/intel_extension_for_pytorch_structure.png differ diff --git a/xpu/2.3.110+xpu/_images/llm_iakv_2.png b/xpu/2.3.110+xpu/_images/llm_iakv_2.png new file mode 100644 index 000000000..3e3d78a17 Binary files /dev/null and b/xpu/2.3.110+xpu/_images/llm_iakv_2.png differ diff --git a/xpu/2.3.110+xpu/_images/llm_kvcache.png b/xpu/2.3.110+xpu/_images/llm_kvcache.png new file mode 100644 index 000000000..fbd098c8b Binary files /dev/null and b/xpu/2.3.110+xpu/_images/llm_kvcache.png differ diff --git a/xpu/2.3.110+xpu/_images/profiler_kineto_result_console.png b/xpu/2.3.110+xpu/_images/profiler_kineto_result_console.png new file mode 100644 index 000000000..8d2c26986 Binary files /dev/null and b/xpu/2.3.110+xpu/_images/profiler_kineto_result_console.png differ diff --git a/xpu/2.3.110+xpu/_images/profiler_kineto_result_perfetto_viewer.png b/xpu/2.3.110+xpu/_images/profiler_kineto_result_perfetto_viewer.png new file mode 100644 index 000000000..2d34416fd Binary files /dev/null and b/xpu/2.3.110+xpu/_images/profiler_kineto_result_perfetto_viewer.png differ diff --git a/xpu/2.3.110+xpu/_images/single_card.png b/xpu/2.3.110+xpu/_images/single_card.png new file mode 100644 index 000000000..4df7f848c Binary files /dev/null and b/xpu/2.3.110+xpu/_images/single_card.png differ diff --git a/xpu/2.3.110+xpu/_images/single_tile.png b/xpu/2.3.110+xpu/_images/single_tile.png new file mode 100644 index 000000000..c4b763707 Binary files /dev/null and b/xpu/2.3.110+xpu/_images/single_tile.png differ diff --git a/xpu/2.3.110+xpu/_images/two_card.png b/xpu/2.3.110+xpu/_images/two_card.png new file mode 100644 index 000000000..670db172d Binary files /dev/null and b/xpu/2.3.110+xpu/_images/two_card.png differ diff --git a/xpu/2.3.110+xpu/_images/weight-only-quantization-flow.png b/xpu/2.3.110+xpu/_images/weight-only-quantization-flow.png new file mode 100644 index 000000000..86d9e732e Binary files /dev/null and b/xpu/2.3.110+xpu/_images/weight-only-quantization-flow.png differ diff --git a/xpu/2.3.110+xpu/_sources/design_doc/cpu/isa_dyndisp.md.txt b/xpu/2.3.110+xpu/_sources/design_doc/cpu/isa_dyndisp.md.txt new file mode 100644 index 000000000..20c7931a8 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/design_doc/cpu/isa_dyndisp.md.txt @@ -0,0 +1,481 @@ +# Intel® Extension for PyTorch\* CPU ISA Dynamic Dispatch Design Doc + +This document explains the dynamic kernel dispatch mechanism for Intel® Extension for PyTorch\* (IPEX) based on CPU ISA. It is an extension to the similar mechanism in PyTorch. + +## Overview + +IPEX dyndisp is forked from **PyTorch:** `ATen/native/DispatchStub.h` and `ATen/native/DispatchStub.cpp`. IPEX adds additional CPU ISA level support, such as `AVX512_VNNI`, `AVX512_BF16` and `AMX`. + +PyTorch & IPEX CPU ISA support statement: + + | | DEFAULT | AVX2 | AVX2_VNNI | AVX512 | AVX512_VNNI | AVX512_BF16 | AMX | AVX512_FP16 + | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | + | PyTorch | ✔ | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | ✘ | + | IPEX-1.11 | ✘ | ✔ | ✘ | ✔ | ✘ | ✘ | ✘ | ✘ | + | IPEX-1.12 | ✘ | ✔ | ✘ | ✔ | ✔ | ✔ | ✔ | ✘ | + | IPEX-1.13 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | + | IPEX-2.1 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | + | IPEX-2.2 | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | + +\* Current IPEX DEFAULT level implemented as same as AVX2 level. + +### CPU ISA build compiler requirement + | ISA Level | GCC requirement | + | ---- | ---- | + | AVX2 | Any | + | AVX512 | GCC 9.2+ | + | AVX512_VNNI | GCC 9.2+ | + | AVX512_BF16 | GCC 10.3+ | + | AVX2_VNNI | GCC 11.2+ | + | AMX | GCC 11.2+ | + | AVX512_FP16 | GCC 12.1+ | + +\* Check with `cmake/Modules/FindAVX.cmake` for detailed compiler checks. + +## Dynamic Dispatch Design + +Dynamic dispatch copies the kernel implementation source files to multiple folders for each ISA level. It then builds each file using its ISA specific parameters. Each generated object file will contain its function body (**Kernel Implementation**). + +Kernel Implementation uses an anonymous namespace so that different CPU versions won't conflict. + +**Kernel Stub** is a "virtual function" with polymorphic kernel implementations pertaining to ISA levels. + +At the runtime, **Dispatch Stub implementation** will check CPUIDs and OS status to determins which ISA level pointer best matches the function body. + +### Code Folder Struct +>#### **Kernel implementation:** `csrc/cpu/aten/kernels/xyzKrnl.cpp` +>#### **Kernel Stub:** `csrc/cpu/aten/xyz.cpp` and `csrc/cpu/aten/xyz.h` +>#### **Dispatch Stub implementation:** `csrc/cpu/dyndisp/DispatchStub.cpp` and `csrc/cpu/dyndisp/DispatchStub.h` + +### CodeGen Process +IPEX build system will generate code for each ISA level with specifiy complier parameters. The CodeGen script is located at `cmake/cpu/IsaCodegen.cmake`. + +The CodeGen will copy each cpp files from **Kernel implementation**, and then add ISA level as new file suffix. + +> **Sample:** +> +> ---- +> +> **Origin file:** +> +> `csrc/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp` +> +> **Generate files:** +> +> DEFAULT: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.DEFAULT.cpp -O3 -D__AVX__ -DCPU_CAPABILITY_AVX2 -mavx2 -mfma -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -DCPU_CAPABILITY=DEFAULT -DCPU_CAPABILITY_DEFAULT` +> +> AVX2: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX2.cpp -O3 -D__AVX__ -mavx2 -mfma -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -DCPU_CAPABILITY=AVX2 -DCPU_CAPABILITY_AVX2` +> +> AVX512: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512.cpp -O3 -D__AVX512F__ -mavx512f -mavx512bw -mavx512vl -mavx512dq -mfma -DCPU_CAPABILITY=AVX512 -DCPU_CAPABILITY_AVX512` +> +> AVX512_VNNI: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_VNNI.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mfma -DCPU_CAPABILITY=AVX512_VNNI -DCPU_CAPABILITY_AVX512_VNNI` +> +> AVX512_BF16: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_BF16.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -DCPU_CAPABILITY=AVX512_BF16 -DCPU_CAPABILITY_AVX512_BF16` +> +> AMX: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AMX.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -DCPU_CAPABILITY=AMX -DCPU_CAPABILITY_AMX` +> +> AVX512_FP16: `build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_FP16.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -mavx512fp16 -DCPU_CAPABILITY_AMX -DCPU_CAPABILITY=AVX512_FP16 -DCPU_CAPABILITY_AVX512_FP16` +--- + +>**Note:** +>1. DEFAULT level kernels is not fully implemented in IPEX. In order to align to PyTorch, we build default use AVX2 parameters in stead of that. So, IPEX minimal required executing machine support AVX2. +>2. `-D__AVX__` and `-D__AVX512F__` is defined for depends library [sleef](https://sleef.org/) . +>3. `-DCPU_CAPABILITY_AVX512` and `-DCPU_CAPABILITY_AVX2` are must to be defined for **PyTorch:** `aten/src/ATen/cpu/vec`, it determins vec register width. +>4. `-DCPU_CAPABILITY=[ISA_NAME]` is must to be defined for **PyTorch:** `aten/src/ATen/cpu/vec`, it is used as inline namespace name. +>5. Higher ISA level is compatible to lower ISA levels, so it needs to contains level ISA feature definitions. Such as AVX512_BF16 need contains `-DCPU_CAPABILITY_AVX512` `-DCPU_CAPABILITY_AVX512_VNNI`. But AVX512 don't contains AVX2 definitions, due to there are different vec register width. + +## Add Custom Kernel + +If you want to add a new custom kernel, and the kernel uses CPU ISA instructions, refer to these tips: + +1. Add CPU ISA related kernel implementation to the folder: `csrc/cpu/aten/kernels/NewKernelKrnl.cpp` +2. Add kernel stub to the folder: `csrc/cpu/aten/NewKernel.cpp` +3. Include header file: `csrc/cpu/dyndisp/DispatchStub.h`, and reference to the comment in the header file. +```c++ +// Implements instruction set specific function dispatch. +// +// Kernels that may make use of specialized instruction sets (e.g. AVX2) are +// compiled multiple times with different compiler flags (e.g. -mavx2). A +// DispatchStub contains a table of function pointers for a kernel. At runtime, +// the fastest available kernel is chosen based on the features reported by +// cpuinfo. +// +// Example: +// +// In csrc/cpu/aten/MyKernel.h: +// using fn_type = void(*)(const Tensor& x); +// IPEX_DECLARE_DISPATCH(fn_type, stub); +// +// In csrc/cpu/aten/MyKernel.cpp +// IPEX_DEFINE_DISPATCH(stub); +// +// In csrc/cpu/aten/kernels/MyKernel.cpp: +// namespace { +// // use anonymous namespace so that different cpu versions won't conflict +// void kernel(const Tensor& x) { ... } +// } +// IPEX_REGISTER_DISPATCH(stub, &kernel); +// +// To call: +// stub(kCPU, tensor); +``` +4. Write the kernel follow the guide. It contains: declare function type, register stub, call stub, etc. + +>**Note:** +> +>1. Some kernels only call **oneDNN** or **iDeep** implementation, or other backend implementation, which is not needed to add kernel implementations. (Refer: `BatchNorm.cpp`) +>2. Vec related header file must be included in kernel implementation files, but can not be included in kernel stub. Kernel stub is common code for all ISA level, and can't pass ISA related compiler parameters. +>3. For more intrinsics, check the [Intel® Intrinsics Guide](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html). + +### ISA intrinics specific kernel example: + +This is a FP32 convert to BF16 function example, and it is implemented for `AVX512_BF16`, `AVX512` and `DEFAULT` ISA levels. + +```c++ +//csrc/cpu/aten/CvtFp32ToBf16.h + +#pragma once + +#include + +namespace torch_ipex { +namespace cpu { + +void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len); + +namespace { + +void cvt_fp32_to_bf16_kernel_impl(at::BFloat16* dst, const float* src, int len); + +} + +using cvt_fp32_to_bf16_kernel_fn = void (*)(at::BFloat16*, const float*, int); +IPEX_DECLARE_DISPATCH(cvt_fp32_to_bf16_kernel_fn, cvt_fp32_to_bf16_kernel_stub); +} // namespace cpu +} // namespace torch_ipex + +``` +```c++ +//csrc/cpu/aten/CvtFp32ToBf16.cpp + +#include "CvtFp32ToBf16.h" + +namespace torch_ipex { +namespace cpu { + +IPEX_DEFINE_DISPATCH(cvt_fp32_to_bf16_kernel_stub); + +void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len) { + return cvt_fp32_to_bf16_kernel_stub(kCPU, dst, src, len); +} + +} // namespace cpu +} // namespace torch_ipex + +``` +Macro `CPU_CAPABILITY_AVX512` and `CPU_CAPABILITY_AVX512_BF16` are defined by compiler check, it is means that current compiler havs capability to generate defined ISA level code. + +Because of `AVX512_BF16` is higher level than `AVX512`, and it compatible to `AVX512`. `CPU_CAPABILITY_AVX512_BF16` can be contained in `CPU_CAPABILITY_AVX512` region. +```c++ +//csrc/cpu/aten/kernels/CvtFp32ToBf16Krnl.cpp + +#include +#include "csrc/aten/cpu/CvtFp32ToBf16.h" + +namespace torch_ipex { +namespace cpu { + +namespace { + +#if defined(CPU_CAPABILITY_AVX512) +#include +#else +#include +#endif +using namespace at::vec; + +#if defined(CPU_CAPABILITY_AVX512) +#include + +inline __m256i _cvt_fp32_to_bf16(const __m512 src) { +#if (defined CPU_CAPABILITY_AVX512_BF16) // AVX512_BF16 ISA implementation. + return reinterpret_cast<__m256i>(_mm512_cvtneps_pbh(src)); +#else // AVX512 ISA implementation. + __m512i value = _mm512_castps_si512(src); + __m512i nan = _mm512_set1_epi32(0xffff); + auto mask_value = _mm512_cmp_ps_mask(src, src, _CMP_ORD_Q); + __m512i ones = _mm512_set1_epi32(0x1); + __m512i vec_bias = _mm512_set1_epi32(0x7fff); + // uint32_t lsb = (input >> 16) & 1; + auto t_value = _mm512_and_si512(_mm512_srli_epi32(value, 16), ones); + // uint32_t rounding_bias = 0x7fff + lsb; + t_value = _mm512_add_epi32(t_value, vec_bias); + // input += rounding_bias; + t_value = _mm512_add_epi32(t_value, value); + // input = input >> 16; + t_value = _mm512_srli_epi32(t_value, 16); + // Check NaN before converting back to bf16 + t_value = _mm512_mask_blend_epi32(mask_value, nan, t_value); + return _mm512_cvtusepi32_epi16(t_value); +#endif +} + +void cvt_fp32_to_bf16_kernel_impl( + at::BFloat16* dst, + const float* src, + int len) { + int i = 0; + for (; i < len - 15; i += 16) { + auto f32 = _mm512_loadu_ps(src + i); + _mm256_storeu_si256((__m256i*)(dst + i), _cvt_fp32_to_bf16(f32)); + } + if (i < len) { + auto mask = (1 << (len - i)) - 1; + auto f32 = _mm512_maskz_loadu_ps(mask, src + i); + _mm256_mask_storeu_epi16(dst + i, mask, _cvt_fp32_to_bf16(f32)); + } +} + +#else // DEFAULT ISA implementation. + +void cvt_fp32_to_bf16_kernel_impl( + at::BFloat16* dst, + const float* src, + int len) { + for (int j = 0; j < len; j++) { + *(dst + j) = *(src + j); + } +} + +#endif + +} // anonymous namespace + +IPEX_REGISTER_DISPATCH(cvt_fp32_to_bf16_kernel_stub, &cvt_fp32_to_bf16_kernel_impl); + +} // namespace cpu +} // namespace torch_ipex + +``` + +### Vec specific kernel example: +This example shows how to get the data type size and its Vec size. In different ISA, Vec has a different register width and a different Vec size. + +```c++ +//csrc/cpu/aten/GetVecLength.h +#pragma once + +#include + +namespace torch_ipex { +namespace cpu { + +std::tuple get_cpp_typesize_and_vecsize(at::ScalarType dtype); + +namespace { + +std::tuple get_cpp_typesize_and_vecsize_kernel_impl( + at::ScalarType dtype); +} + +using get_cpp_typesize_and_vecsize_kernel_fn = + std::tuple (*)(at::ScalarType); +IPEX_DECLARE_DISPATCH( + get_cpp_typesize_and_vecsize_kernel_fn, + get_cpp_typesize_and_vecsize_kernel_stub); + +} // namespace cpu +} // namespace torch_ipex + +``` + +```c++ +//csrc/cpu/aten/GetVecLength.cpp + +#include "GetVecLength.h" + +namespace torch_ipex { +namespace cpu { + +IPEX_DEFINE_DISPATCH(get_cpp_typesize_and_vecsize_kernel_stub); + +// get cpp typesize and vectorsize by at::ScalarType +std::tuple get_cpp_typesize_and_vecsize(at::ScalarType dtype) { + return get_cpp_typesize_and_vecsize_kernel_stub(kCPU, dtype); +} + +} // namespace cpu +} // namespace torch_ipex + +``` + +```c++ +//csrc/cpu/aten/kernels/GetVecLengthKrnl.cpp + +#include +#include "csrc/cpu/aten/GetVecLength.h" + +namespace torch_ipex { +namespace cpu { + +namespace { + +std::tuple get_cpp_typesize_and_vecsize_kernel_impl( + at::ScalarType dtype) { + switch (dtype) { + case at::ScalarType::Double: + return std::make_tuple( + sizeof(double), at::vec::Vectorized::size()); + case at::ScalarType::Float: + return std::make_tuple(sizeof(float), at::vec::Vectorized::size()); + case at::ScalarType::ComplexDouble: + return std::make_tuple( + sizeof(c10::complex), + at::vec::Vectorized>::size()); + case at::ScalarType::ComplexFloat: + return std::make_tuple( + sizeof(c10::complex), + at::vec::Vectorized>::size()); + case at::ScalarType::BFloat16: + return std::make_tuple( + sizeof(decltype( + c10::impl::ScalarTypeToCPPType::t)), + at::vec::Vectorized::t)>::size()); + case at::ScalarType::Half: + return std::make_tuple( + sizeof(decltype( + c10::impl::ScalarTypeToCPPType::t)), + at::vec::Vectorized::t)>::size()); + default: + TORCH_CHECK( + false, + "Currently only floating and complex ScalarType are supported."); + } +} + +} // anonymous namespace + +IPEX_REGISTER_DISPATCH( + get_cpp_typesize_and_vecsize_kernel_stub, + &get_cpp_typesize_and_vecsize_kernel_impl); + +} // namespace cpu +} // namespace torch_ipex + +``` +## Private Debug APIs + +Here are three ISA-related private APIs that can help debugging:: +1. Query current ISA level. +2. Query max CPU supported ISA level. +3. Query max binary supported ISA level. +>**Note:** +> +>1. Max CPU supported ISA level only depends on CPU features. +>2. Max binary supported ISA level only depends on built complier version. +>3. Current ISA level, it is the smaller of `max CPU ISA level` and `max binary ISA level`. + +### Example: +```bash +python +Python 3.9.7 (default, Sep 16 2021, 13:09:58) +[GCC 7.5.0] :: Anaconda, Inc. on linux +Type "help", "copyright", "credits" or "license" for more information. +>>> import intel_extension_for_pytorch._C as core +>>> core._get_current_isa_level() +'AMX' +>>> core._get_highest_cpu_support_isa_level() +'AMX' +>>> core._get_highest_binary_support_isa_level() +'AMX' +>>> quit() +``` + +## Select ISA level manually. + +By default, IPEX dispatches to the kernels with the maximum ISA level supported by the underlying CPU hardware. This ISA level can be overridden by the environment variable `ATEN_CPU_CAPABILITY` (same environment variable as PyTorch). The available values are {`avx2`, `avx512`, `avx512_vnni`, `avx512_bf16`, `amx`, `avx512_fp16`}. The effective ISA level would be the minimal level between `ATEN_CPU_CAPABILITY` and the maximum level supported by the hardware. +### Example: +```bash +$ python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())' +AMX +$ ATEN_CPU_CAPABILITY=avx2 python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())' +AVX2 +``` +>**Note:** +> +>`core._get_current_isa_level()` is an IPEX internal function used for checking the current effective ISA level. It is used for debugging purpose only and subject to change. + +## CPU feature check + +An addtional CPU feature check tool in the subfolder: `tests/cpu/isa` + +```bash +$ cmake . +-- The C compiler identification is GNU 11.2.1 +-- The CXX compiler identification is GNU 11.2.1 +-- Detecting C compiler ABI info +-- Detecting C compiler ABI info - done +-- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/cc - skipped +-- Detecting C compile features +-- Detecting C compile features - done +-- Detecting CXX compiler ABI info +-- Detecting CXX compiler ABI info - done +-- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/c++ - skipped +-- Detecting CXX compile features +-- Detecting CXX compile features - done +-- Configuring done +-- Generating done +-- Build files have been written to: tests/cpu/isa +$ make +[ 33%] Building CXX object CMakeFiles/cpu_features.dir/intel_extension_for_pytorch/csrc/cpu/isa/cpu_feature.cpp.o +[ 66%] Building CXX object CMakeFiles/cpu_features.dir/intel_extension_for_pytorch/csrc/cpu/isa/cpu_feature_main.cpp.o +[100%] Linking CXX executable cpu_features +[100%] Built target cpu_features +$ ./cpu_features +XCR0: 00000000000602e7 +os --> avx: true +os --> avx2: true +os --> avx512: true +os --> amx: true +mmx: true +sse: true +sse2: true +sse3: true +ssse3: true +sse4_1: true +sse4_2: true +aes_ni: true +sha: true +xsave: true +fma: true +f16c: true +avx: true +avx2: true +avx_vnni: true +avx512_f: true +avx512_cd: true +avx512_pf: false +avx512_er: false +avx512_vl: true +avx512_bw: true +avx512_dq: true +avx512_ifma: true +avx512_vbmi: true +avx512_vpopcntdq: true +avx512_4fmaps: false +avx512_4vnniw: false +avx512_vbmi2: true +avx512_vpclmul: true +avx512_vnni: true +avx512_bitalg: true +avx512_fp16: true +avx512_bf16: true +avx512_vp2intersect: true +amx_bf16: true +amx_tile: true +amx_int8: true +prefetchw: true +prefetchwt1: false +``` diff --git a/xpu/2.3.110+xpu/_sources/index.rst.txt b/xpu/2.3.110+xpu/_sources/index.rst.txt new file mode 100644 index 000000000..d2a30c291 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/index.rst.txt @@ -0,0 +1,90 @@ +.. meta:: + :description: This website introduces Intel® Extension for PyTorch* + :keywords: Intel optimization, PyTorch, Intel® Extension for PyTorch*, GPU, discrete GPU, Intel discrete GPU + +Intel® Extension for PyTorch* +############################# + +Intel® Extension for PyTorch* extends PyTorch* with the latest performance optimizations for Intel hardware. +Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel X\ :sup:`e`\ Matrix Extensions (XMX) AI engines on Intel discrete GPUs. +Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* ``xpu`` device. + +In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Starting from 2.1.0, specific optimizations for certain +Large Language Models (LLMs) are introduced in the Intel® Extension for PyTorch*. For more information on LLM optimizations, refer to the `Large Language Models (LLMs) <./tutorials/llm.html>`_ section. + +The extension can be loaded as a Python module for Python programs or linked as a C++ library for C++ programs. In Python scripts, users can enable it dynamically by importing ``intel_extension_for_pytorch``. + +.. note:: + - CPU features are not included in GPU-only packages. + - GPU features are not included in CPU-only packages. + - Optimizations for CPU-only may have a newer code base due to different development schedules. + +Intel® Extension for PyTorch* has been released as an open–source project at `Github `_. You can find the source code and instructions on how to get started at: + +- **CPU**: `CPU main branch `_ | `Quick Start `_ +- **XPU**: `XPU main branch `_ | `Quick Start `_ + +You can find more information about the product at: + +- `Features `_ +- `Performance `_ + +Architecture +------------ + +Intel® Extension for PyTorch* is structured as shown in the following figure: + +.. figure:: ./images/intel_extension_for_pytorch_structure.png + :width: 800 + :align: center + :alt: Architecture of Intel® Extension for PyTorch* + + Architecture of Intel® Extension for PyTorch* + +- **Eager Mode**: In the eager mode, the PyTorch frontend is extended with custom Python modules (such as fusion modules), optimal optimizers, and INT8 quantization APIs. Further performance improvement is achieved by converting eager-mode models into graph mode using extended graph fusion passes. +- **Graph Mode**: In the graph mode, fusions reduce operator/kernel invocation overhead, resulting in improved performance. Compared to the eager mode, the graph mode in PyTorch* normally yields better performance from the optimization techniques like operation fusion. Intel® Extension for PyTorch* amplifies them with more comprehensive graph optimizations. Both PyTorch ``Torchscript`` and ``TorchDynamo`` graph modes are supported. With ``Torchscript``, we recommend using ``torch.jit.trace()`` as your preferred option, as it generally supports a wider range of workloads compared to ``torch.jit.script()``. With ``TorchDynamo``, ipex backend is available to provide good performances. +- **CPU Optimization**: On CPU, Intel® Extension for PyTorch* automatically dispatches operators to underlying kernels based on detected instruction set architecture (ISA). The extension leverages vectorization and matrix acceleration units available on Intel hardware. The runtime extension offers finer-grained thread runtime control and weight sharing for increased efficiency. +- **GPU Optimization**: On GPU, optimized operators and kernels are implemented and registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel GPU hardware. Intel® Extension for PyTorch* for GPU utilizes the `DPC++ `_ compiler that supports the latest `SYCL* `_ standard and also a number of extensions to the SYCL* standard, which can be found in the `sycl/doc/extensions `_ directory. + + +Support +------- +The team tracks bugs and enhancement requests using `GitHub issues `_. Before submitting a suggestion or bug report, search the existing GitHub issues to see if your issue has already been reported. + +.. toctree:: + :caption: ABOUT + :maxdepth: 3 + :hidden: + + tutorials/introduction + tutorials/features + Large Language Models (LLM) + tutorials/performance + tutorials/technical_details + tutorials/releases + tutorials/known_issues + Blogs & Publications + tutorials/license + +.. toctree:: + :maxdepth: 3 + :caption: GET STARTED + :hidden: + + Installation + tutorials/getting_started + tutorials/examples + +.. toctree:: + :maxdepth: 3 + :caption: DEVELOPER REFERENCE + :hidden: + + tutorials/api_doc + +.. toctree:: + :maxdepth: 3 + :caption: CONTRIBUTING GUIDE + :hidden: + + tutorials/contribution diff --git a/xpu/2.3.110+xpu/_sources/tutorials/api_doc.rst.txt b/xpu/2.3.110+xpu/_sources/tutorials/api_doc.rst.txt new file mode 100644 index 000000000..3d35dd00e --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/api_doc.rst.txt @@ -0,0 +1,54 @@ +API Documentation +################# + +General +======= + +.. currentmodule:: intel_extension_for_pytorch +.. autofunction:: optimize +.. currentmodule:: intel_extension_for_pytorch.llm +.. autofunction:: optimize +.. currentmodule:: intel_extension_for_pytorch +.. autofunction:: get_fp32_math_mode +.. autofunction:: set_fp32_math_mode + +Memory management +================= + +.. currentmodule:: intel_extension_for_pytorch.xpu +.. autofunction:: empty_cache +.. list_gpu_processes +.. mem_get_info +.. autofunction:: memory_stats +.. autofunction:: memory_summary +.. autofunction:: memory_snapshot +.. autofunction:: memory_allocated +.. autofunction:: max_memory_allocated +.. reset_max_memory_allocated +.. autofunction:: memory_reserved +.. autofunction:: max_memory_reserved +.. set_per_process_memory_fraction +.. memory_cached +.. max_memory_cached +.. reset_max_memory_cached +.. autofunction:: reset_peak_memory_stats +.. caching_allocator_alloc +.. caching_allocator_delete + +.. autofunction:: memory_stats_as_nested_dict +.. autofunction:: reset_accumulated_memory_stats + + +Quantization +============ + +.. currentmodule:: intel_extension_for_pytorch.quantization.fp8 +.. autofunction:: fp8_autocast + + +C++ API +======= + +.. doxygenenum:: torch_ipex::xpu::FP32_MATH_MODE + +.. doxygenfunction:: torch_ipex::xpu::set_fp32_math_mode diff --git a/xpu/2.3.110+xpu/_sources/tutorials/contribution.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/contribution.md.txt new file mode 100644 index 000000000..934f5165b --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/contribution.md.txt @@ -0,0 +1,129 @@ +Contribution +============ + +## Contributing to Intel® Extension for PyTorch\* + +Thank you for your interest in contributing to Intel® Extension for PyTorch\*. Before you begin writing code, it is important that you share your intention to contribute with the team, based on the type of contribution: + +1. You want to propose a new feature and implement it. + - Post about your intended feature in a [GitHub issue](https://github.com/intel/intel-extension-for-pytorch/issues), and we shall discuss the design and implementation. Once we agree that the plan looks good, go ahead and implement it. +2. You want to implement a feature or bug-fix for an outstanding issue. + - Search for your issue in the [GitHub issue list](https://github.com/intel/intel-extension-for-pytorch/issues). + - Pick an issue and comment that you'd like to work on the feature or bug-fix. + - If you need more context on a particular issue, ask and we shall provide. + +Once you implement and test your feature or bug-fix, submit a Pull Request to https://github.com/intel/intel-extension-for-pytorch. + +## Developing Intel® Extension for PyTorch\* on XPU + +A full set of instructions on installing Intel® Extension for PyTorch\* from source is in the [Installation document](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.30%2bxpu). + +To develop on your machine, here are some tips: + +1. Uninstall all existing Intel® Extension for PyTorch\* installs. You may need to run `pip uninstall intel_extension_for_pytorch` multiple times. You'll know `intel_extension_for_pytorch` is fully uninstalled when you see `WARNING: Skipping intel_extension_for_pytorch as it is not installed`. (You should only have to `pip uninstall` a few times, but you can always `uninstall` with `timeout` or in a loop.) + + ```bash + yes | pip uninstall intel_extension_for_pytorch + ``` + +2. Clone a copy of Intel® Extension for PyTorch\* from source: + + ```bash + git clone https://github.com/intel/intel-extension-for-pytorch.git -b xpu-main + cd intel-extension-for-pytorch + ``` + + If you already have Intel® Extension for PyTorch\* from source, update it: + + ```bash + git pull --rebase + git submodule sync --recursive + git submodule update --init --recursive --jobs 0 + ``` + +3. Install Intel® Extension for PyTorch\* in `develop` mode: + + Replace: + + ```bash + python setup.py install + ``` + + with: + + ```bash + python setup.py develop + ``` + + This mode will symlink the Python files from the current local source tree into the Python install. After that, if you modify a Python file, you do not need to reinstall Intel® Extension for PyTorch\* again. This is especially useful if you are only changing Python files. + + For example: + - Install local Intel® Extension for PyTorch\* in `develop` mode + - modify your Python file `intel_extension_for_pytorch/__init__.py` (for example) + - test functionality + +You do not need to repeatedly install after modifying Python files (`.py`). However, you would need to reinstall if you modify a Python interface (`.pyi`, `.pyi.in`) or non-Python files (`.cpp`, `.h`, etc.). + +If you want to reinstall, make sure that you uninstall Intel® Extension for PyTorch\* first by running `pip uninstall intel_extension_for_pytorch` until you see `WARNING: Skipping intel_extension_for_pytorch as it is not installed`. Then run `python setup.py clean`. After that, you can install in `develop` mode again. + +### Tips and Debugging + +* Our `setup.py` requires Python >= 3.6 +* If you run into errors when running `python setup.py develop`, here are some debugging steps: + 1. Remove your `build` directory. The `setup.py` script compiles binaries into the `build` folder and caches many details along the way. This saves time the next time you build. If you're running into issues, you can always `rm -rf build` from the toplevel directory and start over. + 2. If you have made edits to the Intel® Extension for PyTorch\* repo, commit any change you'd like to keep and clean the repo with the following commands (note that clean _really_ removes all untracked files and changes.): + ```bash + git submodule deinit -f . + git clean -xdf + python setup.py clean + git submodule update --init --recursive --jobs 0 # very important to sync the submodules + python setup.py develop # then try running the command again + ``` + 3. The main step within `python setup.py develop` is running `make` from the `build` directory. If you want to experiment with some environment variables, you can pass them into the command: + ```bash + ENV_KEY1=ENV_VAL1[, ENV_KEY2=ENV_VAL2]* python setup.py develop + ``` + +## Unit testing + +All Python test suites are located in the `tests/gpu` folder and start with `test_`. Run individual test suites using the command `python tests/gpu/${Sub_Folder}/FILENAME.py`, where `FILENAME` represents the file containing the test suite you wish to run and `${Sub_Folder}` is one of the following folders: +- examples: unit tests created during op development +- experimental: ported [test suites](https://github.com/pytorch/pytorch/tree/v1.10.0/test) from Stock PyTorch 1.10 +- regression: unit tests created during bug fix to avoid future regression + +### Better local unit tests with `pytest` + +We don't officially support `pytest`, but it works well with our unit tests and offers a number of useful features for local developing. Install it via `pip install pytest`. + +For more information about unit tests, please read [README.md](https://github.com/intel/intel-extension-for-pytorch/blob/xpu-main/tests/gpu/README.md) in the `tests/gpu` folder. + +## Writing documentation + +Do you want to write some documentation for your code contribution and don't know where to start? + +Intel® Extension for PyTorch\* uses [Google style](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) for formatting docstrings. Length of line inside docstrings block must be limited to 80 characters to fit into Jupyter documentation popups. + +### Building documentation + +To build the documentation: + +1. Build and install Intel® Extension for PyTorch\* (as discussed above) + +2. Install the prerequisites: + + ```bash + cd docs + pip install -r requirements.txt + ``` + +3. Generate the documentation HTML files. The generated files will be in `docs/_build/html`. + + ```bash + make clean + make html + ``` + +#### Tips + +The `.rst` source files live in `docs/tutorials` folder. Some of the `.rst` files pull in docstrings from Intel® Extension for PyTorch\* Python code (for example, via the `autofunction` or `autoclass` directives). To shorten doc build times, it is helpful to remove the files you are not working on, only keeping the base `index.rst` file and the files you are editing. The Sphinx build will produce missing file warnings but will still complete. + diff --git a/xpu/2.3.110+xpu/_sources/tutorials/examples.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/examples.md.txt new file mode 100644 index 000000000..1b5ce1f8f --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/examples.md.txt @@ -0,0 +1,982 @@ +Examples +======== + +These examples will help you get started using Intel® Extension for PyTorch\* +with Intel GPUs. + +**Prerequisites**: +Before running these examples, install the `torchvision` and `transformers` Python packages. + +- [Python](#python) examples demonstrate usage of Python APIs: + + - [Training](#training) + - [Inference](#inference) + +- [C++](#c) examples demonstrate usage of C++ APIs +- [Intel® AI Reference Models](#intel-ai-reference-models) provide out-of-the-box use cases, demonstrating the performance benefits achievable with Intel Extension for PyTorch\* + + +## Python + +### Training + +#### Single-Instance Training + +To use Intel® Extension for PyTorch\* on training, you need to make the following changes in your code: + +1. Import `intel_extension_for_pytorch` as `ipex`. +2. Use the `ipex.optimize` function for additional performance boost, which applies optimizations against the model object, as well as an optimizer object. +3. Use Auto Mixed Precision (AMP) with BFloat16 data type. +4. Convert input tensors, loss criterion and model to XPU, as shown below: + +``` +... +import torch +import intel_extension_for_pytorch as ipex +... +model = Model() +criterion = ... +optimizer = ... +model.train() +# Move model and loss criterion to xpu before calling ipex.optimize() +model = model.to("xpu") +criterion = criterion.to("xpu") + +# For Float32 +model, optimizer = ipex.optimize(model, optimizer=optimizer) +# For BFloat16 +model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16) +... +dataloader = ... +for (input, target) in dataloader: + input = input.to("xpu") + target = target.to("xpu") + optimizer.zero_grad() + # For Float32 + output = model(input) + + # For BFloat16 + with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16): + output = model(input) + + loss = criterion(output, target) + loss.backward() + optimizer.step() +... +``` + +Below you can find complete code examples demonstrating how to use the extension on training for different data types: + +##### Float32 + +[//]: # (marker_train_single_fp32_complete) +```python +import torch +import torchvision + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +LR = 0.001 +DOWNLOAD = True +DATA = "datasets/cifar10/" + +transform = torchvision.transforms.Compose( + [ + torchvision.transforms.Resize((224, 224)), + torchvision.transforms.ToTensor(), + torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), + ] +) +train_dataset = torchvision.datasets.CIFAR10( + root=DATA, + train=True, + transform=transform, + download=DOWNLOAD, +) +train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=128) + +model = torchvision.models.resnet50() +criterion = torch.nn.CrossEntropyLoss() +optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9) +model.train() +######################## code changes ####################### +model = model.to("xpu") +criterion = criterion.to("xpu") +model, optimizer = ipex.optimize(model, optimizer=optimizer) +######################## code changes ####################### + +for batch_idx, (data, target) in enumerate(train_loader): + ########## code changes ########## + data = data.to("xpu") + target = target.to("xpu") + ########## code changes ########## + optimizer.zero_grad() + output = model(data) + loss = criterion(output, target) + loss.backward() + optimizer.step() + print(batch_idx) +torch.save( + { + "model_state_dict": model.state_dict(), + "optimizer_state_dict": optimizer.state_dict(), + }, + "checkpoint.pth", +) + +print("Execution finished") +``` +[//]: # (marker_train_single_fp32_complete) + +##### BFloat16 + +[//]: # (marker_train_single_bf16_complete) +```python +import torch +import torchvision + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +LR = 0.001 +DOWNLOAD = True +DATA = "datasets/cifar10/" + +transform = torchvision.transforms.Compose( + [ + torchvision.transforms.Resize((224, 224)), + torchvision.transforms.ToTensor(), + torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), + ] +) +train_dataset = torchvision.datasets.CIFAR10( + root=DATA, + train=True, + transform=transform, + download=DOWNLOAD, +) +train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=128) + +model = torchvision.models.resnet50() +criterion = torch.nn.CrossEntropyLoss() +optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9) +model.train() +##################################### code changes ################################ +model = model.to("xpu") +criterion = criterion.to("xpu") +model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16) +##################################### code changes ################################ + +for batch_idx, (data, target) in enumerate(train_loader): + optimizer.zero_grad() + ######################### code changes ######################### + data = data.to("xpu") + target = target.to("xpu") + with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16): + ######################### code changes ######################### + output = model(data) + loss = criterion(output, target) + loss.backward() + optimizer.step() + print(batch_idx) +torch.save( + { + "model_state_dict": model.state_dict(), + "optimizer_state_dict": optimizer.state_dict(), + }, + "checkpoint.pth", +) + +print("Execution finished") +``` +[//]: # (marker_train_single_bf16_complete) + +### Inference + +Get additional performance boosts for your computer vision and NLP workloads by +applying the Intel® Extension for PyTorch\* `optimize` function against your +model object. + +#### Float32 + +##### Imperative Mode + +###### Resnet50 + +[//]: # (marker_inf_rn50_imp_fp32) +```python +import torch +import torchvision.models as models + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(1, 3, 224, 224) + +######## code changes ####### +model = model.to("xpu") +data = data.to("xpu") +model = ipex.optimize(model) +######## code changes ####### + +with torch.no_grad(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_imp_fp32) + +###### BERT + +[//]: # (marker_inf_bert_imp_fp32) +```python +import torch +from transformers import BertModel + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 1 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) + +######## code changes ####### +model = model.to("xpu") +data = data.to("xpu") +model = ipex.optimize(model) +######## code changes ####### + +with torch.no_grad(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_bert_imp_fp32) + +##### TorchScript Mode + +We recommend using Intel® Extension for PyTorch\* with [TorchScript](https://pytorch.org/docs/stable/jit.html) for further optimizations. + +###### Resnet50 + +[//]: # (marker_inf_rn50_ts_fp32) +```python +import torch +import torchvision.models as models + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(1, 3, 224, 224) + +######## code changes ####### +model = model.to("xpu") +data = data.to("xpu") +model = ipex.optimize(model) +######## code changes ####### + +with torch.no_grad(): + d = torch.rand(1, 3, 224, 224) + ##### code changes ##### + d = d.to("xpu") + ##### code changes ##### + model = torch.jit.trace(model, d) + model = torch.jit.freeze(model) + + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_ts_fp32) + +###### BERT + +[//]: # (marker_inf_bert_ts_fp32) +```python +import torch +from transformers import BertModel + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 1 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) + +######## code changes ####### +model = model.to("xpu") +data = data.to("xpu") +model = ipex.optimize(model) +######## code changes ####### + +with torch.no_grad(): + d = torch.randint(vocab_size, size=[batch_size, seq_length]) + ##### code changes ##### + d = d.to("xpu") + ##### code changes ##### + model = torch.jit.trace(model, (d,), strict=False) + model = torch.jit.freeze(model) + + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_bert_ts_fp32) + +#### BFloat16 + +The `optimize` function works for both Float32 and BFloat16 data type. For BFloat16 data type, set the `dtype` parameter to `torch.bfloat16`. +We recommend using Auto Mixed Precision (AMP) with BFloat16 data type. + + +##### Imperative Mode + +###### Resnet50 + +[//]: # (marker_inf_rn50_imp_bf16) +```python +import torch +import torchvision.models as models + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(1, 3, 224, 224) + +#################### code changes ################# +model = model.to("xpu") +data = data.to("xpu") +model = ipex.optimize(model, dtype=torch.bfloat16) +#################### code changes ################# + +with torch.no_grad(): + ############################# code changes ##################### + with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16): + ############################ code changes ###################### + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_imp_bf16) + +###### BERT + +[//]: # (marker_inf_bert_imp_bf16) +```python +import torch +from transformers import BertModel + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 1 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) + +#################### code changes ################# +model = model.to("xpu") +data = data.to("xpu") +model = ipex.optimize(model, dtype=torch.bfloat16) +#################### code changes ################# + +with torch.no_grad(): + ########################### code changes ######################## + with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16): + ########################### code changes ######################## + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_bert_imp_bf16) + +##### TorchScript Mode + +We recommend using Intel® Extension for PyTorch\* with [TorchScript](https://pytorch.org/docs/stable/jit.html) for further optimizations. + +###### Resnet50 + +[//]: # (marker_inf_rn50_ts_bf16) +```python +import torch +import torchvision.models as models + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(1, 3, 224, 224) + +#################### code changes ################# +model = model.to("xpu") +data = data.to("xpu") +model = ipex.optimize(model, dtype=torch.bfloat16) +#################### code changes ################# + +with torch.no_grad(): + d = torch.rand(1, 3, 224, 224) + ############################# code changes ##################### + d = d.to("xpu") + with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16): + ############################# code changes ##################### + model = torch.jit.trace(model, d) + model = torch.jit.freeze(model) + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_ts_bf16) + +###### BERT + +[//]: # (marker_inf_bert_ts_bf16) +```python +import torch +from transformers import BertModel + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 1 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) + +#################### code changes ################# +model = model.to("xpu") +data = data.to("xpu") +model = ipex.optimize(model, dtype=torch.bfloat16) +#################### code changes ################# + +with torch.no_grad(): + d = torch.randint(vocab_size, size=[batch_size, seq_length]) + ############################# code changes ##################### + d = d.to("xpu") + with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16): + ############################# code changes ##################### + model = torch.jit.trace(model, (d,), strict=False) + model = torch.jit.freeze(model) + + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_bert_ts_bf16) + +#### Float16 + +The `optimize` function works for both Float32 and Float16 data type. For Float16 data type, set the `dtype` parameter to `torch.float16`. +We recommend using Auto Mixed Precision (AMP) with Float16 data type. + +##### Imperative Mode + +###### Resnet50 + +[//]: # (marker_inf_rn50_imp_fp16) +```python +import torch +import torchvision.models as models + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(1, 3, 224, 224) + +#################### code changes ################ +model = model.to("xpu") +data = data.to("xpu") +model = ipex.optimize(model, dtype=torch.float16) +#################### code changes ################ + +with torch.no_grad(): + ############################# code changes ##################### + with torch.xpu.amp.autocast(enabled=True, dtype=torch.float16): + ############################# code changes ##################### + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_imp_fp16) + +###### BERT + +[//]: # (marker_inf_bert_imp_fp16) +```python +import torch +from transformers import BertModel + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 1 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) + +#################### code changes ################ +model = model.to("xpu") +data = data.to("xpu") +model = ipex.optimize(model, dtype=torch.float16) +#################### code changes ################ + +with torch.no_grad(): + ############################# code changes ##################### + with torch.xpu.amp.autocast(enabled=True, dtype=torch.float16): + ############################# code changes ##################### + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_bert_imp_fp16) + +##### TorchScript Mode + +We recommend using Intel® Extension for PyTorch\* with [TorchScript](https://pytorch.org/docs/stable/jit.html) for further optimizations. + +###### Resnet50 + +[//]: # (marker_inf_rn50_ts_fp16) +```python +import torch +import torchvision.models as models + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(1, 3, 224, 224) + +#################### code changes ################ +model = model.to("xpu") +data = data.to("xpu") +model = ipex.optimize(model, dtype=torch.float16) +#################### code changes ################ + +with torch.no_grad(): + d = torch.rand(1, 3, 224, 224) + ############################# code changes ##################### + d = d.to("xpu") + with torch.xpu.amp.autocast(enabled=True, dtype=torch.float16): + ############################# code changes ##################### + model = torch.jit.trace(model, d) + model = torch.jit.freeze(model) + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_ts_fp16) + +###### BERT + +[//]: # (marker_inf_bert_ts_fp16) +```python +import torch +from transformers import BertModel + +############# code changes ############### +import intel_extension_for_pytorch as ipex + +############# code changes ############### + +model = BertModel.from_pretrained("bert-base-uncased") +model.eval() + +vocab_size = model.config.vocab_size +batch_size = 1 +seq_length = 512 +data = torch.randint(vocab_size, size=[batch_size, seq_length]) + +#################### code changes ################ +model = model.to("xpu") +data = data.to("xpu") +model = ipex.optimize(model, dtype=torch.float16) +#################### code changes ################ + +with torch.no_grad(): + d = torch.randint(vocab_size, size=[batch_size, seq_length]) + ############################# code changes ##################### + d = d.to("xpu") + with torch.xpu.amp.autocast(enabled=True, dtype=torch.float16): + ############################# code changes ##################### + model = torch.jit.trace(model, (d,), strict=False) + model = torch.jit.freeze(model) + + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_bert_ts_fp16) + +#### INT8 + +We recommend using TorchScript for INT8 model because it has wider support for models. TorchScript mode also auto-enables our optimizations. For TorchScript INT8 model, inserting observer and model quantization is achieved through `prepare_jit` and `convert_jit` separately. Calibration process is required for collecting statistics from real data. After conversion, optimizations such as operator fusion would be auto-enabled. + +[//]: # (marker_int8_static) +```python +import os +import torch +from torch.jit._recursive import wrap_cpp_module +from torch.quantization.quantize_jit import ( + convert_jit, + prepare_jit, +) + +#################### code changes #################### +import intel_extension_for_pytorch as ipex + +###################################################### + +##### Example Model ##### +import torchvision.models as models + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +model = model.to("xpu") + +with torch.no_grad(): + data = torch.rand(1, 3, 224, 224) + data = data.to("xpu") + modelJit = torch.jit.trace(model, data) +######################### + +qconfig = torch.quantization.QConfig( + activation=torch.quantization.observer.MinMaxObserver.with_args( + qscheme=torch.per_tensor_symmetric, reduce_range=False, dtype=torch.quint8 + ), + weight=torch.quantization.default_weight_observer, +) +modelJit = prepare_jit(modelJit, {"": qconfig}, True) + +##### Example Dataloader ##### +import torchvision + +DOWNLOAD = True +DATA = "datasets/cifar10/" + +transform = torchvision.transforms.Compose( + [ + torchvision.transforms.Resize((224, 224)), + torchvision.transforms.ToTensor(), + torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), + ] +) +train_dataset = torchvision.datasets.CIFAR10( + root=DATA, + train=True, + transform=transform, + download=DOWNLOAD, +) +calibration_data_loader = torch.utils.data.DataLoader( + dataset=train_dataset, batch_size=128 +) + +for batch_idx, (d, target) in enumerate(calibration_data_loader): + print(f"calibrated on batch {batch_idx} out of {len(calibration_data_loader)}") + d = d.to("xpu") + modelJit(d) +############################## + +modelJit = convert_jit(modelJit, True) + +data = torch.rand(1, 3, 224, 224) +data = data.to("xpu") +modelJit(data) + +print("Execution finished") +``` +[//]: # (marker_int8_static) + +#### torch.xpu.optimize + +The `torch.xpu.optimize` function is an alternative to `ipex.optimize` in Intel® Extension for PyTorch\*, and provides identical usage for XPU devices only. The motivation for adding this alias is to unify the coding style in user scripts base on `torch.xpu` modular. Refer to the example below for usage. + +[//]: # (marker_inf_rn50_imp_fp32_alt) +```python +import torch +import torchvision.models as models + +############# code changes ######### +import intel_extension_for_pytorch + +############# code changes ######### + +model = models.resnet50(weights="ResNet50_Weights.DEFAULT") +model.eval() +data = torch.rand(1, 3, 224, 224) + +model = model.to(memory_format=torch.channels_last) +data = data.to(memory_format=torch.channels_last) + +########## code changes ######### +model = model.to("xpu") +data = data.to("xpu") +model = torch.xpu.optimize(model) +########## code changes ######### + +with torch.no_grad(): + model(data) + +print("Execution finished") +``` +[//]: # (marker_inf_rn50_imp_fp32_alt) + +## C++ + +To work with libtorch, the PyTorch C++ library, Intel® Extension for PyTorch\* provides its own C++ dynamic library. The C++ library only handles inference workloads, such as service deployment. For regular development, use the Python interface. Unlike using libtorch, no specific code changes are required. Compilation follows the recommended methodology with CMake. Detailed instructions can be found in the [PyTorch tutorial](https://pytorch.org/tutorials/advanced/cpp_export.html#depending-on-libtorch-and-building-the-application). + +During compilation, Intel optimizations will be activated automatically after the C++ dynamic library of Intel® Extension for PyTorch\* is linked. + +The example code below works for all data types. + +### Basic Usage + +**Download and Install cppsdk** + +Ensure you have download and install cppsdk in the [installation page](https://intel.github.io/intel-extension-for-pytorch/index.html#installation) before compiling the cpp code. + +1. Go to [installation page](https://intel.github.io/intel-extension-for-pytorch/index.html#installation) +2. Select the desired Platform & Version & OS +3. In the package part, select cppsdk +4. Follow the instructions in the cppsdk installation page to download and install cppsdk into libtorch. + +**example-app.cpp** + +[//]: # (marker_cppsdk_sample_app) +```cpp +#include +#include +#include + +int main(int argc, const char* argv[]) { + torch::jit::script::Module module; + try { + module = torch::jit::load(argv[1]); + } + catch (const c10::Error& e) { + std::cerr << "error loading the model\n"; + return -1; + } + module.to(at::kXPU); + + std::vector inputs; + torch::Tensor input = torch::rand({1, 3, 224, 224}).to(at::kXPU); + inputs.push_back(input); + + at::Tensor output = module.forward(inputs).toTensor(); + std::cout << output.slice(/*dim=*/1, /*start=*/0, /*end=*/5) << std::endl; + std::cout << "Execution finished" << std::endl; + + return 0; +} +``` +[//]: # (marker_cppsdk_sample_app) + +**CMakeLists.txt** + +[//]: # (marker_cppsdk_cmake_app) +```cmake +cmake_minimum_required(VERSION 3.0 FATAL_ERROR) +project(example-app) + +find_package(IPEX REQUIRED) + +set(target example-app) +add_executable(${target} example-app.cpp) +target_link_libraries(${target} ${TORCH_IPEX_LIBRARIES}) + +set_property(TARGET ${target} PROPERTY CXX_STANDARD 17) +``` +[//]: # (marker_cppsdk_cmake_app) + +**Command for compilation** + +```bash +$ cd examples/gpu/inference/cpp/example-app +$ mkdir build +$ cd build +$ CC=icx CXX=icpx cmake -DCMAKE_PREFIX_PATH= .. +$ make +``` +The is the absolute path of libtorch we install at the first step. + +If *Found IPEX* is shown as dynamic library paths, the extension was linked into the binary. This can be verified with the Linux command *ldd*. + +The value of x, y, z in the following log will change depending on the version you choose. + +```bash +$ CC=icx CXX=icpx cmake -DCMAKE_PREFIX_PATH=/workspace/libtorch .. +-- The C compiler identification is IntelLLVM 202x.y.z +-- The CXX compiler identification is IntelLLVM 202x.y.z +-- Detecting C compiler ABI info +-- Detecting C compiler ABI info - done +-- Check for working C compiler: /workspace/intel/oneapi/compiler/202x.y.z/linux/bin/icx - skipped +-- Detecting C compile features +-- Detecting C compile features - done +-- Detecting CXX compiler ABI info +-- Detecting CXX compiler ABI info - done +-- Check for working CXX compiler: /workspace/intel/oneapi/compiler/202x.y.z/linux/bin/icpx - skipped +-- Detecting CXX compile features +-- Detecting CXX compile features - done +-- Looking for pthread.h +-- Looking for pthread.h - found +-- Performing Test CMAKE_HAVE_LIBC_PTHREAD +-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success +-- Found Threads: TRUE +-- Found Torch: /workspace/libtorch/lib/libtorch.so +-- Found IPEX: /workspace/libtorch/lib/libintel-ext-pt-cpu.so;/workspace/libtorch/lib/libintel-ext-pt-gpu.so +-- Configuring done +-- Generating done +-- Build files have been written to: examples/gpu/inference/cpp/example-app/build + +$ ldd example-app + ... + libtorch.so => /workspace/libtorch/lib/libtorch.so (0x00007fd5bb927000) + libc10.so => /workspace/libtorch/lib/libc10.so (0x00007fd5bb895000) + libtorch_cpu.so => /workspace/libtorch/lib/libtorch_cpu.so (0x00007fd5a44d8000) + libintel-ext-pt-cpu.so => /workspace/libtorch/lib/libintel-ext-pt-cpu.so (0x00007fd5a1a1b000) + libintel-ext-pt-gpu.so => /workspace/libtorch/lib/libintel-ext-pt-gpu.so (0x00007fd5862b0000) + ... + libmkl_intel_lp64.so.2 => /workspace/intel/oneapi/mkl/202x.y.z/lib/intel64/libmkl_intel_lp64.so.2 (0x00007fd584ab0000) + libmkl_core.so.2 => /workspace/intel/oneapi/mkl/202x.y.z/lib/intel64/libmkl_core.so.2 (0x00007fd5806cc000) + libmkl_gnu_thread.so.2 => /workspace/intel/oneapi/mkl/202x.y.z/lib/intel64/libmkl_gnu_thread.so.2 (0x00007fd57eb1d000) + libmkl_sycl.so.3 => /workspace/intel/oneapi/mkl/202x.y.z/lib/intel64/libmkl_sycl.so.3 (0x00007fd55512c000) + libOpenCL.so.1 => /workspace/intel/oneapi/compiler/202x.y.z/linux/lib/libOpenCL.so.1 (0x00007fd55511d000) + libsvml.so => /workspace/intel/oneapi/compiler/202x.y.z/linux/compiler/lib/intel64_lin/libsvml.so (0x00007fd553b11000) + libirng.so => /workspace/intel/oneapi/compiler/202x.y.z/linux/compiler/lib/intel64_lin/libirng.so (0x00007fd553600000) + libimf.so => /workspace/intel/oneapi/compiler/202x.y.z/linux/compiler/lib/intel64_lin/libimf.so (0x00007fd55321b000) + libintlc.so.5 => /workspace/intel/oneapi/compiler/202x.y.z/linux/compiler/lib/intel64_lin/libintlc.so.5 (0x00007fd553a9c000) + libsycl.so.6 => /workspace/intel/oneapi/compiler/202x.y.z/linux/lib/libsycl.so.6 (0x00007fd552f36000) + ... +``` + +### Use SYCL code + +Using SYCL code in an C++ application is also possible. The example below shows how to invoke SYCL codes. You need to explicitly pass `-fsycl` into `CMAKE_CXX_FLAGS`. + +**example-usm.cpp** + +[//]: # (marker_cppsdk_sample_usm) +```cpp +#include +#include +#include +#include +#include +#include + +using namespace sycl; + +int main(int argc, const char* argv[]) { + torch::jit::script::Module module; + try { + module = torch::jit::load(argv[1]); + } + catch (const c10::Error& e) { + std::cerr << "error loading the model\n"; + return -1; + } + std::cout << "load model done " << std::endl; + module.to(at::kXPU); + + std::vector inputs; + c10::xpu::XPUStream stream = c10::xpu::getCurrentXPUStream(); + auto options = at::TensorOptions().dtype(at::kFloat).device(stream.device()); + float *input_ptr = malloc_device(224 * 224 * 3, stream); + auto input = torch::from_blob( + input_ptr, + {1, 3, 224, 224}, + options); + std::cout << "input tensor created from usm " << std::endl; + inputs.push_back(input); + + at::IValue output = module.forward(inputs); + torch::Tensor output_tensor; + output_tensor = output.toTensor(); + std::cout << output_tensor.slice(/*dim=*/1, /*start=*/0, /*end=*/5) << std::endl; + std::cout << "Execution finished" << std::endl; + + return 0; +} +``` +[//]: # (marker_cppsdk_sample_usm) + +**CMakeLists.txt** + +[//]: # (marker_cppsdk_cmake_usm) +```cmake +cmake_minimum_required(VERSION 3.0 FATAL_ERROR) +project(example-usm) + +find_package(IPEX REQUIRED) + +set(target example-usm) +add_executable(${target} example-usm.cpp) +target_link_libraries(${target} ${TORCH_IPEX_LIBRARIES}) +list(APPEND CMAKE_CXX_FLAGS "-fsycl") + +set_property(TARGET ${target} PROPERTY CXX_STANDARD 17) +``` +[//]: # (marker_cppsdk_cmake_usm) + +### Customize DPC++ kernels + +Intel® Extension for PyTorch\* provides its C++ dynamic library to allow users to implement custom DPC++ kernels to run on the XPU device. Refer to the [DPC++ extension](./features/DPC++_Extension.md) for details. + + +## Intel® AI Reference Models + +Use cases that have already been optimized by Intel engineers are available at [Intel® AI Reference Models](https://github.com/IntelAI/models/tree/v3.1.1) (former Model Zoo). A number of PyTorch use cases for benchmarking are also available in the [Use Cases](https://github.com/IntelAI/models/tree/v3.1.1?tab=readme-ov-file#use-cases) section. Models verified on Intel GPUs are marked in the `Model Documentation` column. You can get performance benefits out-of-the-box by simply running scripts in the Intel® AI Reference Models. diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features.rst.txt b/xpu/2.3.110+xpu/_sources/tutorials/features.rst.txt new file mode 100644 index 000000000..597629675 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features.rst.txt @@ -0,0 +1,183 @@ +Features +======== + +Easy-to-use Python API +---------------------- + +Intel® Extension for PyTorch\* provides simple frontend Python APIs and utilities to get performance optimizations such as operator optimization. + +Check the `API Documentation `_ for API functions description and `Examples `_ for usage guidance. + +Channels Last +------------- + +Compared with the default NCHW memory format, using channels_last (NHWC) memory format can further accelerate convolutional neural networks. In Intel® Extension for PyTorch\*, NHWC memory format has been enabled for most key CPU and GPU operators. More detailed information is available at `Channels Last `_. + +Intel® Extension for PyTorch* automatically converts a model to channels last memory format when users optimize the model with ``ipex.optimize(model)``. With this feature, users do not need to manually apply ``model=model.to(memory_format=torch.channels_last)`` anymore. However, models running on Intel® Data Center GPU Flex Series will choose oneDNN layout, so users still need to manually convert the model and data to channels last format. More detailed information is available at `Auto Channels Last `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/nhwc + features/auto_channels_last + + +Auto Mixed Precision (AMP) +-------------------------- + +Benefiting from less memory usage and computation, low precision data types typically speed up both training and inference workloads. +On GPU side, support of BFloat16 and Float16 are both available in Intel® Extension for PyTorch\*. BFloat16 is the default low precision floating data type when AMP is enabled. + +Detailed information of AMP for GPU are available at `Auto Mixed Precision (AMP) on GPU `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/amp_gpu + + +Quantization +------------ + +Intel® Extension for PyTorch* currently supports imperative mode and TorchScript mode for post-training static quantization on GPU. This section illustrates the quantization workflow on Intel GPUs. + +Check more detailed information for `INT8 Quantization `_. + +On Intel® GPUs, Intel® Extension for PyTorch* also provides FP8 Quantization. Check more detailed information for `FP8 Quantization <./features/float8.md>`_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/int8_overview_xpu + features/float8 + + +Distributed Training +-------------------- + +To meet demands of large scale model training over multiple devices, distributed training on Intel® GPUs is supported. Two alternative methodologies are available. Users can choose either to use PyTorch native distributed training module, `Distributed Data Parallel (DDP) `_, with `Intel® oneAPI Collective Communications Library (oneCCL) `_ support via `Intel® oneCCL Bindings for PyTorch (formerly known as torch_ccl) `_ or use Horovod with `Intel® oneAPI Collective Communications Library (oneCCL) `_ support (Prototype). + +For more detailed information, check `DDP `_ and `Horovod (Prototype) `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/DDP + features/horovod + + +DLPack Solution +--------------- + +DLPack defines a stable in-memory data structure for sharing tensors among frameworks. It enables sharing of tensor data without copying when interoparating with other libraries. Intel® Extension for PyTorch* extends DLPack support in PyTorch* for XPU device particularly. + +For more detailed information, check `DLPack Solution `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/DLPack + +DPC++ Extension +--------------- + +Intel® Extension for PyTorch\* provides C++ APIs to get SYCL queue and configure floating-point math mode. + +Check the `API Documentation`_ for the details of API functions. `DPC++ Extension `_ describes how to write customized DPC++ kernels with a practical example and build it with setuptools and CMake. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/DPC++_Extension + +Advanced Configuration +---------------------- + +The default settings for Intel® Extension for PyTorch* are sufficient for most use cases. However, if you need to customize Intel® Extension for PyTorch*, advanced configuration is available at build time and runtime. + +For more detailed information, check `Advanced Configuration `_. + +A driver environment variable `ZE_FLAT_DEVICE_HIERARCHY` is currently used to select the device hierarchy model with which the underlying hardware is exposed. By default, each GPU tile is used as a device. Check the `Level Zero Specification Documentation `_ for more details. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/advanced_configuration + +Fully Sharded Data Parallel (FSDP) +---------------------------------- + +`Fully Sharded Data Parallel (FSDP)` is a PyTorch\* module that provides industry-grade solution for large model training. FSDP is a type of data parallel training, unlike DDP, where each process/worker maintains a replica of the model, FSDP shards model parameters, optimizer states and gradients across DDP ranks to reduce the GPU memory footprint used in training. This makes the training of some large-scale models feasible. + +For more detailed information, check `FSDP `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/FSDP + +torch.compile for GPU (Beta) +---------------------------- + +Intel® Extension for PyTorch\* now empowers users to seamlessly harness graph compilation capabilities for optimal PyTorch model performance on Intel GPU via the flagship `torch.compile `_ API through the default "inductor" backend (`TorchInductor `_ ). + +For more detailed information, check `torch.compile for GPU `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/torch_compile_gpu + +Kineto Supported Profiler Tool (Prototype) +------------------------------------------ + +The Kineto supported profiler tool is an extension of PyTorch\* profiler for profiling operators' executing time cost on GPU devices. With this tool, you can get information in many fields of the run models or code scripts. Build Intel® Extension for PyTorch\* with Kineto support as default and enable this tool using the `with` statement before the code segment. + +For more detailed information, check `Profiler Kineto `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/profiler_kineto + + +Compute Engine (Prototype feature for debug) +-------------------------------------------- + +Compute engine is a prototype feature which provides the capacity to choose specific backend for operators with multiple implementations. + +For more detailed information, check `Compute Engine `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/compute_engine + + +``IPEX_LOG`` (Prototype feature for debug) +------------------------------------------ + + +``IPEX_LOG`` provides the capability to log verbose information from Intel® Extension for PyTorch\* . Please use ``IPEX_LOG`` to get the log information or trace the execution from Intel® Extension for PyTorch\*. Please continue using PyTorch\* macros such as ``TORCH_CHECK``, ``TORCH_ERROR``, etc. to get the log information from PyTorch\*. + +For more detailed information, check `IPEX_LOG `_. + +.. toctree:: + :hidden: + :maxdepth: 1 + + features/ipex_log + + + diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/DDP.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/DDP.md.txt new file mode 100644 index 000000000..2ec2872e3 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/DDP.md.txt @@ -0,0 +1,238 @@ +DistributedDataParallel (DDP) +============================= + +## Introduction + +`DistributedDataParallel (DDP)` is a PyTorch\* module that implements multi-process data parallelism across multiple GPUs and machines. With DDP, the model is replicated on every process, and each model replica is fed a different set of input data samples. Please refer to [DDP Tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) for an introduction to DDP. + +The PyTorch `Collective Communication (c10d)` library supports communication across processes. To run DDP on GPU, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing such collectives as `allreduce`, `allgather`, and `alltoall`. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL. + +## Installation of Intel® oneCCL Bindings for Pytorch\* + +To use PyTorch DDP on GPU, install Intel® oneCCL Bindings for Pytorch\* as described below. + +### Install PyTorch and Intel® Extension for PyTorch\* + +Make sure you have installed PyTorch and Intel® Extension for PyTorch\* successfully. +For more detailed information, check [Installation Guide](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu). + +### Install Intel® oneCCL Bindings for Pytorch\* + +#### [Recommended] Install from prebuilt wheels + + +1. Install `oneccl_bindings_for_pytorch` + +``` +# Generic Python* for CPU +REPO_URL: https://pytorch-extension.intel.com/release-whl/stable/cpu/us/ +# Generic Python* for GPU +REPO_URL: https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +``` + +Installation from either repository shares the command below. Replace the place holder `` with a real URL mentioned above. + +```bash +python -m pip install oneccl_bind_pt --extra-index-url +``` + + +#### Install from source + +Refer to [Installation Guide](https://github.com/intel/torch-ccl/tree/ccl_torch2.1.300+xpu?tab=readme-ov-file#install-from-source) to install Intel® oneCCL Bindings for Pytorch\* from source. + +### Runtime Dynamic Linking + + +- dynamic link oneCCl from oneAPI basekit: + +```bash +source /ccl/latest/env/vars.sh +source /mpi/latest/env/vars.sh +``` + +Note: Make sure you have installed [basekit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#base-kit) when using Intel® oneCCL Bindings for Pytorch\* on Intel® GPUs. If the basekit is installed with a package manager, is `/opt/intel/oneapi`. + + +## DDP Usage + +DDP follows its usage in PyTorch. To use DDP with Intel® Extension for PyTorch\*, make the following modifications to your model script: + +1. Import the necessary packages. +```python +import torch +import intel_extension_for_pytorch +import oneccl_bindings_for_pytorch +``` +2. Initialize the process group with ccl backend. +```python +dist.init_process_group(backend='ccl') +``` +3. For DDP with each process exclusively works on a single GPU, set the device ID as `local rank`. This step is not required for usage on CPU. +```python +device = "xpu:{}".format(args.local_rank) +torch.xpu.set_device(device) +``` +4. Wrap model by DDP. +```python +model = model.to(device) +model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[device]) +``` + +Note: For single-device modules, `device_ids` can contain exactly one device id, which represents the only GPU device where the input module corresponding to this process resides. Alternatively, device_ids can be `None`. + +Note: When using `torch.xpu.optimize` for distributed training with low precision, the `torch.xpu.manual_seed(seed_number)` is needed to make sure the master weight is the same on all ranks. + +## Example Usage (MPI launch for single node): + +Intel® oneCCL Bindings for Pytorch\* recommends MPI as the launcher to start multiple processes. Here's an example to illustrate such usage. + +Dynamic link oneCCL and Intel MPI libraries: + +```bash +source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh +# Or +source /ccl/latest/env/vars.sh +source /mpi/latest/env/vars.sh +``` + +`Example_DDP.py` + +```python +""" +This example shows how to use MPI as the launcher to start DDP on single node with multiple devices. +""" +import os +import torch +import torch.nn as nn +from torch.nn.parallel import DistributedDataParallel as DDP +import torch.distributed as dist +import intel_extension_for_pytorch +import oneccl_bindings_for_pytorch + + +class Model(nn.Module): + def __init__(self): + super(Model, self).__init__() + self.linear = nn.Linear(4, 5) + + def forward(self, input): + return self.linear(input) + + +if __name__ == "__main__": + + torch.xpu.manual_seed(123) # set a seed number + mpi_world_size = int(os.environ.get('PMI_SIZE', -1)) + mpi_rank = int(os.environ.get('PMI_RANK', -1)) + if mpi_world_size > 0: + os.environ['RANK'] = str(mpi_rank) + os.environ['WORLD_SIZE'] = str(mpi_world_size) + else: + # set the default rank and world size to 0 and 1 + os.environ['RANK'] = str(os.environ.get('RANK', 0)) + os.environ['WORLD_SIZE'] = str(os.environ.get('WORLD_SIZE', 1)) + os.environ['MASTER_ADDR'] = '127.0.0.1' # your master address + os.environ['MASTER_PORT'] = '29500' # your master port + + # Initialize the process group with ccl backend + dist.init_process_group(backend='ccl') + + # For single-node distributed training, local_rank is the same as global rank + local_rank = dist.get_rank() + # Only set device for distributed training on GPU + device = "xpu:{}".format(local_rank) + model = Model().to(device) + if dist.get_world_size() > 1: + model = DDP(model, device_ids=[device]) + + optimizer = torch.optim.SGD(model.parameters(), lr=0.001) + loss_fn = nn.MSELoss().to(device) + for i in range(3): + print("Runing Iteration: {} on device {}".format(i, device)) + input = torch.randn(2, 4).to(device) + labels = torch.randn(2, 5).to(device) + # forward + print("Runing forward: {} on device {}".format(i, device)) + res = model(input) + # loss + print("Runing loss: {} on device {}".format(i, device)) + L = loss_fn(res, labels) + # backward + print("Runing backward: {} on device {}".format(i, device)) + L.backward() + # update + print("Runing optim: {} on device {}".format(i, device)) + optimizer.step() +``` + +Running command: + +```bash +mpirun -n 2 -l python Example_DDP.py +``` + +## DDP scaling API (GPU Only) + +For using one GPU card with multiple tiles, each tile could be regarded as a device for explicit scaling. We provide a DDP scaling API to enable DDP on one GPU card in [GitHub repo](https://github.com/intel/intel-extension-for-pytorch/blob/xpu-master/intel_extension_for_pytorch/xpu/single_card.py). + +### Usage of DDP scaling API + +Note: This API supports GPU devices on one card. + +```python +Args: +model: model to be parallelized +train_dataset: dataset for training +``` + +If you have a model running on a single tile, you only need to make minor changes to enable the DDP training by following these steps: + +1. Import the API: + +```python +try: + from intel_extension_for_pytorch.xpu.single_card import single_card_dist +except ImportError: + raise ImportError("single_card_dist not available!") +``` + +2. Use multi_process_spawn launcher as a torch.multiprocessing wrapper. + +```python +single_card_dist.multi_process_spawn(main_worker, (args, )) # put arguments of main_worker into a tuple +``` + +3. Usage of this API: + +```python +dist = single_card_dist(model, train_dataset) +local_rank, model, train_sampler = dist.rank, dist.model, dist.train_sampler +``` + +4. Set in the model training: + +```python +for epoch in range ... + train_sampler.set_epoch(epoch) +``` + +5. Adjust the model to call `local_rank`, `model`, and `train_sampler` as shown here: + +- device: get the xpu information used in model training + +```python +xpu = "xpu:{}".format(local_rank) +print("DDP Use XPU: {} for training".format(xpu)) +``` + +- model: use the model warpped by DDP in the following training + +- train_sampler: use the train_sampler to get the train_loader + +```python +train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None), + num_workers=args.workers, pin_memory=True, sampler=train_sampler) +``` +Then you can start your model training on multiple GPU devices of one card. + diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/DLPack.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/DLPack.md.txt new file mode 100644 index 000000000..4d41d272b --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/DLPack.md.txt @@ -0,0 +1,72 @@ +DLPack Solution +=============== + +## Introduction + +[DLPack](https://dmlc.github.io/dlpack/latest/) defines a stable in-memory data structure for sharing tensors among frameworks. It is a solution with wide community adoption and supports Numpy, PyTorch, and other popular frameworks in deep learning domain. Intel® Extension for PyTorch\* extends DLPack support in PyTorch for XPU backend particularly, in order to share tensor data without copy when interoperating with other libraries via DLPack solution. The current supported DLPack version is [v0.7](https://github.com/dmlc/dlpack/releases/tag/v0.7). + +## Use Case + +The following use case demonstrates two typical DLPack usages relate to Intel® Extension for PyTorch\*. One is to import external tensor to Intel® Extension for PyTorch\*. The tensor from an external library is packed in DLPack capsule, then converted to PyTorch tensor on XPU, to be operable in Intel® Extension for PyTorch\*. The other is to export PyTorch tensor on XPU to an external library. The PyTorch tensor on XPU is packed in DLPack capsule, so that the external library can operate on this shared tensor via DLPack solution. + +```python +import intel_extension_for_pytorch +import torch.utils.dlpack + +# create DLPack capsule from external +capsule = ... + +# Usage 1: convert DLPack capsule to PyTorch tensor on XPU +t = torch.from_dlpack(capsule) + +# create PyTorch tensor on XPU +t2 = torch.empty([10], device='xpu') + +# Usage 2: convert PyTorch tensor on XPU to DLPack capsule +capsule2 = torch.to_dlpack(t2) + +``` + +## Design + +When import an external tensor which is in `DLManagedTensor` format, a PyTorch tensor is created and other required information such as `dim`, `sizes`, and `strides` are parsed and extracted from the external tensor to PyTorch tensor by Stock PyTorch. The `data_ptr` points to the original memory allocation and a data copy is not required. Here Intel® Extension for PyTorch\* is responsible for converting device type and id from `DLDevice` to ATen device for XPU backend.
+ +### Import DLPack Capsule + +![fig-1-DLPack-import](../../images/DLPack/figure1_DLPack_import.svg) + +When exporting PyTorch tensor, a `ATenDLMTensor` is created with its `handle` pointing to the original PyTorch tensor and its `tensor` contains the exported tensor in `DLManagedTensor` format. The required information such as `ndim`, `shape`, and `strides` are parsed and extracted from PyTorch tensor to external tensor. The `data pointer` points to the original memory allocation and data copy is not required. Here Intel® Extension for PyTorch\* is responsible for converting device type and id from ATen device to `DLDevice` for XPU backend.
+ +### Export DLPack Capsule + +![fig-2-DLPack-import](../../images/DLPack/figure2_DLPack_export.svg) + +Note: The used `DLManagedTensor` format in above figures is from https://dmlc.github.io/dlpack/latest/python_spec.html. + +### `DLDevice` and `data pointer` + +`DLDeviceType` in `DLDevice` is `kDLOneAPI` for sharing memory between Intel® Extension for PyTorch\* and other libraries. It is not `kDLSycl` since it relies on oneAPI SYCL extensions filter_selector and default platform context to operate. The `device_id` in `DLDevice` is one of the SYCL runtime device ids, which may be different from the actual framework device in use. When producing a DLPack capsule, DPC++ runtime will get the device where memory allocation was original made. If the device has parent device, we will find its parent device index enumerated in `sycl::device::get_devices()` then put to `device_id`. + +`data pointer` points to the shared data via DLPack to be accessed by consumer. Only USM allocations are valid in `data pointer` when `DLDeviceType` is `kDLOneAPI`. [SYCL 2020 Specification](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html) defines three types of USM allocations: `sycl::usm::host`, `sycl::usm::device`, and `sycl::usm::shared`. `sycl::usm::device` is the only supported type. Also the USM allocations in `sycl::usm::device` are valid in DLPack only when the memory allocation was made under [default SYCL context](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_oneapi_default_context.asciidoc) per SYCL platform. + +## Asynchronous Programming + +So far, DLPack defines how the producer shares memory allocations in DLPack capsule format and how consumer recognizes the shared memory allocations. It does not define the synchronization method between producer and consumer so that both sides know when it is safe to access the data in shared memory allocations. Under the situation that the producer and the consumer probably have different implementation for supporting asynchronous programming, it is hard to define a general solution for various scenarios. It is up to consumer to monitor the execution flow of Intel® Extension for PyTorch\* and find out when the data is ready to use. + +The following example shows one possible solution for the consumer to safely use USM allocations from Intel® Extension for PyTorch\*. + +### Example Case + +```python +import intel_extension_for_pytorch +import torch.utils.dlpack + +# Get shared tensor from Intel® Extension for PyTorch* via DLPack +t = torch.from_dlpack(capsule) + +# Wait for the data ready to use +torch.xpu.synchronize() + +# Use the data in shared tensor +... +``` diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/DPC++_Extension.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/DPC++_Extension.md.txt new file mode 100644 index 000000000..c98f0faed --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/DPC++_Extension.md.txt @@ -0,0 +1,495 @@ +# DPC++ Extension + +## Introduction + +C++ extension is a mechanism developed by PyTorch that lets you to create customized and highly efficient PyTorch operators defined out-of-source, i.e. separate from the PyTorch backend. (For more details, see https://pytorch.org/tutorials/advanced/cpp_extension.html). Based on the PyTorch C++ extension mechanism, Intel® Extension for PyTorch\* lets you to create PyTorch operators with custom DPC++ kernels to run on the XPU device. + +**Note:** The current implementation of the DPC++ extension only supports Linux. + +## Motivation and Example + +This tutorial walks through a practical example of writing and using a DPC++ extension on the XPU device with Intel® Extension for PyTorch\*. + +## Writing a DPC++ Extension + +DPC++ extensions come in two flavors: They can be built “ahead of time” (AOT) with `setuptools`, or “just in time” (JIT) via `torch.xpu.cpp_extension.load()`. We’ll begin with the first approach and discuss the latter one afterwards. + +Besides, DPC++ extension also supports compilation with `CMake`. We’ll discuss the CMake methodology at last. + +### Building with setuptools + +For building with `setuptools`, we build our DPC++ extension by writing a `setup.py` script that uses `setuptools` to compile our C++ code. For the Long-Long-Term-Memory unit (LLTM), it looks like this: +```python +from setuptools import setup +import torch +import intel_extension_for_pytorch +from torch.xpu.cpp_extension import DPCPPExtension, DpcppBuildExtension + +setup( + name='lltm', + ext_modules=[ + DPCPPExtension('lltm_xpu', [ + 'lltm_xpu.cpp', + 'lltm_xpu_kernel.cpp', + ]) + ], + cmdclass={ + 'build_ext': DpcppBuildExtension + }) +``` +In this code, `DPCPPExtension` is a convenience wrapper around `setuptools.Extension` that passes the correct include paths and sets the language of the extension to C++. The equivalent vanilla `setuptools` code would simply be: +```python +Extension( + name='lltm_xpu', + sources=['lltm_xpu.cpp', 'lltm_xpu_kernel.cpp',], + include_dirs=cpp_extension.include_paths(), + language='c++') +``` +`DpcppBuildExtension` performs a number of required configuration steps and checks and also manages compilation in the case of DPC++ extensions. And that’s all we really need to know about building DPC++ extensions for now. + + Let’s take a look at the implementation of our DPC++ extension, which goes into `lltm_xpu.cpp` and `lltm_xpu_kernel.cpp`. +After building the Python module with DPC++ extension, the `lltm_xpu` is available for importing as an extension plug-in. +```python +import lltm_xpu +``` + + +### JIT Compiling Extensions + +Previously, we mentioned that there were two ways of building DPC++ extensions: use setuptools as AOT or compile with JIT. Having the former one introduced, let’s elaborate on the latter one. The JIT compilation mechanism provides a methodology to compile and load your extensions on the fly by invoking a simple `torch` API function `torch.xpu.cpp_extension.load()`. For the LLTM, this would look as simple as this: + +```python +import torch +import intel_extension_for_pytorch +from torch.xpu.cpp_extension import load + +lltm_xpu = load(name="lltm_xpu", sources=['lltm_xpu.cpp', 'lltm_xpu_kernel.cpp',]) +``` +Here, we provide a function with the same information as those for `setuptools`. In the background, the function will do the followings: +1. Create a temporary directory `/tmp/torch_extensions/py[ver]_xpu/lltm_xpu`, +2. Emit a `Ninja` build file into that temporary directory, +3. Compile your source files into a shared library, +4. Import this shared library as a Python module. + +In fact, if you pass `verbose=True` to `cpp_extension.load()`, you will be informed about the process: +``` +Emitting ninja build file /home/[user_name]/.cache/torch_extensions/py[ver]_xpu/lltm_xpu/build.ninja... +Building extension module lltm_xpu... +Loading extension module lltm_xpu... +``` +The resulting Python module are exactly the same as the ones produced by `setuptools`. This avoids maintaining a separate `setup.py` build file. Generally this JIT technique will do the compilation just fine, however, if your setup is more complicated and you do need the full power of `setuptools`, you can still write your own `setup.py`. It will take some time at the first time when you run through this line, as the extension is compiling in the background. Since we use Ninja build system to build source codes, re-compilation is incremental and thus the compilation reloads the extension when you run your Python module from the second time. It is fast and has low overhead if there are no code changes in the extension’s source files. + + +### Building with CMake + +For building with `CMake`, we build our DPC++ extension by writing a `CMakeLists.txt` file that uses CMake to build our C++ code. For the same example we showed using `setuptools`, the `CMakeLists.txt` looks like this: +CMakeLists.txt +```cmake +cmake_minimum_required(VERSION 3.18 FATAL_ERROR) +project(lltm_xpu) + +find_package(Python COMPONENTS Interpreter Development) +find_package(Torch REQUIRED) +find_package(IPEX REQUIRED) + +#The SYCL kernel should be compiled with "-fsycl" +set_source_files_properties(lltm_xpu_kernel.cpp PROPERTIES COMPILE_FLAGS "-fsycl") + +add_library(lltm_xpu SHARED lltm_xpu.cpp lltm_xpu_kernel.cpp) +target_link_libraries(lltm_xpu "${TORCH_LIBRARIES}") +target_link_libraries(lltm_xpu "${TORCH_IPEX_LIBRARIES}") +target_include_directories(lltm_xpu PUBLIC "${Python_INCLUDE_DIRS}") +target_include_directories(lltm_xpu PUBLIC "${TORCH_IPEX_INCLUDE_DIRS}") + +set_property(TARGET lltm_xpu PROPERTY CXX_STANDARD 17) +#DPCPP need 17 +``` +Find cmake_prefix_path of torch and ipex +``` +$ python +>>> import torch +>>> import intel_extension_for_pytorch +>>> torch.utils.cmake_prefix_path +'' +>>> intel_extension_for_pytorch.cmake_prefix_path +'' +``` +Commands for compilation: +``` +$ cmake -DCMAKE_PREFIX_PATH= -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER= .. +$ make +``` +After build the python module with CMake, the `lltm_xpu` is also avalible for importing as a extension plug-in like setuptools method. +``` +$ python +>>> import torch +>>> import intel_extension_for_pytorch +>>> import lltm_xpu +``` + +### Requesting the current c10::xpu::XPUStream + +If you need to get the current `c10::xpu::XPUStream` on the current XPU device to do synchronization. You can implement it as below. +```cpp +c10::xpu::XPUStream stream = c10::xpu::getCurrentXPUStream(); +stream.synchronize(); +``` + +### Fetching the corresponding sycl::queue + +We provide some APIs to fetch the corresponding `sycl::queue` associated with the +current `c10::xpu::XPUStream`. +In C++ code, you can fetch a `sycl::queue` reference as below. +```cpp +c10::xpu::XPUStream stream = c10::xpu::getCurrentXPUStream(); +auto& queue = stream.queue(); +``` +In python code, you can use the below codes to get a `void*`, which can cast to a `sycl::queue` pointer. +```python +import torch +import intel_extension_for_pytorch +stream = torch.xpu.current_stream() +queue = stream.sycl_queue # queue is a ``void*`` which can cast to a sycl::queue pointer +``` +Subsequently, you can submit a customized kernel via `sycl::queue` by yourself. Refer to [Writing the DPC++ Op](#writing-the-dpc-op) for more details. + +### Writing the DPC++ Op + +The general strategy for writing a DPC++ extension is to write a C++ file that defines the functions that are called from Python, and binds those functions to Python with pybind11. The C++ functions do some checks and ultimately forward the calls to submit SYCL kernels. The `ipex.cpp_extension` package then takes care of compiling the C++ sources with a DPC++ compiler. + +Let's consider the PyTorch CUDA examples https://pytorch.org/tutorials/advanced/cpp_extension.html#writing-a-mixed-c-cuda-extension. Here is how we implement it in DPC++ style: +```cpp +#include + +#include + +// XPU forward declarations + +std::vector lltm_xpu_forward( + torch::Tensor input, + torch::Tensor weights, + torch::Tensor bias, + torch::Tensor old_h, + torch::Tensor old_cell); + +std::vector lltm_xpu_backward( + torch::Tensor grad_h, + torch::Tensor grad_cell, + torch::Tensor new_cell, + torch::Tensor input_gate, + torch::Tensor output_gate, + torch::Tensor candidate_cell, + torch::Tensor X, + torch::Tensor gate_weights, + torch::Tensor weights); + +// C++ interface + +#define CHECK_XPU(x) TORCH_CHECK(x.device().is_xpu(), #x " must be a XPU tensor") +#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous") +#define CHECK_INPUT(x) CHECK_XPU(x); CHECK_CONTIGUOUS(x) + +std::vector lltm_forward( + torch::Tensor input, + torch::Tensor weights, + torch::Tensor bias, + torch::Tensor old_h, + torch::Tensor old_cell) { + CHECK_INPUT(input); + CHECK_INPUT(weights); + CHECK_INPUT(bias); + CHECK_INPUT(old_h); + CHECK_INPUT(old_cell); + + return lltm_xpu_forward(input, weights, bias, old_h, old_cell); +} + +std::vector lltm_backward( + torch::Tensor grad_h, + torch::Tensor grad_cell, + torch::Tensor new_cell, + torch::Tensor input_gate, + torch::Tensor output_gate, + torch::Tensor candidate_cell, + torch::Tensor X, + torch::Tensor gate_weights, + torch::Tensor weights) { + CHECK_INPUT(grad_h); + CHECK_INPUT(grad_cell); + CHECK_INPUT(input_gate); + CHECK_INPUT(output_gate); + CHECK_INPUT(candidate_cell); + CHECK_INPUT(X); + CHECK_INPUT(gate_weights); + CHECK_INPUT(weights); + + return lltm_xpu_backward( + grad_h, + grad_cell, + new_cell, + input_gate, + output_gate, + candidate_cell, + X, + gate_weights, + weights); +} + +PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { + m.def("forward", &lltm_forward, "LLTM forward (XPU)"); + m.def("backward", &lltm_backward, "LLTM backward (XPU)"); +} +``` + +The bridge code checks and forwards the calls to functions that we’ll define in the DPC++ code file `lltm_xpu_kernel.cpp`. DPC++ supports compiling C++ naturally, thus we still have ATen and the C++ standard library available to us. + +Let’s go through the DPC++ code step by step: + +```cpp +#include + +#include + +#include + +template +scalar_t sigmoid(scalar_t z) { + return 1.0f / (1.0f + exp(-z)); +} +``` + +At the beginning of the code, we include `` that will introduce all the torch definitions into the code. After that, the `` line includes the SYCL header in DPC++. With the `` and ``, all the essential declarations have been included for writing the DPC++ kernel to run on the XPU device. The helper function `sigmoid` does the math calculation with the more efficient C++ language. Next are some more helper functions for LLTM: + +```cpp +template +scalar_t d_sigmoid(scalar_t z) { + const auto s = sigmoid(z); + return (1.0f - s) * s; +} + +template +scalar_t d_tanh(scalar_t z) { + const auto t = tanh(z); + return 1.0f - (t * t); +} + +template +scalar_t elu(scalar_t z, scalar_t alpha = 1.0f) { + return fmax(0.0f, z) + fmin(0.0f, alpha * (exp(z) - 1.0f)); +} + +template +scalar_t d_elu(scalar_t z, scalar_t alpha = 1.0f) { + const auto e = exp(z); + const auto d_relu = z < 0.0f ? 0.0f : 1.0f; + return d_relu + (((alpha * (e - 1.0f)) < 0.0f) ? (alpha * e) : 0.0f); +} +``` + +Now we can implement the actual code for our extension with two functions in DPC++: +* a function that performs operations we don’t wish to explicitly write by hand and calls into the function to submit the SYCL kernel, +* a function that actual submits the SYCL kernel to the XPU device for the parts we want to speed up. + +For the forward pass, the first function looks like this: + +```cpp +std::vector lltm_xpu_forward( + torch::Tensor input, + torch::Tensor weights, + torch::Tensor bias, + torch::Tensor old_h, + torch::Tensor old_cell) { + auto X = torch::cat({old_h, input}, /*dim=*/1); + auto gates = torch::addmm(bias, X, weights.transpose(0, 1)); + + const auto batch_size = old_cell.size(0); + const auto state_size = old_cell.size(1); + + auto new_h = torch::zeros_like(old_cell); + auto new_cell = torch::zeros_like(old_cell); + auto input_gate = torch::zeros_like(old_cell); + auto output_gate = torch::zeros_like(old_cell); + auto candidate_cell = torch::zeros_like(old_cell); + + AT_DISPATCH_FLOATING_TYPES(gates.type(), "lltm_forward_xpu", ([&] { + lltm_xpu_forward_kernel( + gates.data(), + old_cell.data(), + new_h.data(), + new_cell.data(), + input_gate.data(), + output_gate.data(), + candidate_cell.data(), + state_size, + batch_size); + })); + + return {new_h, new_cell, input_gate, output_gate, candidate_cell, X, gates}; +} +``` + +The purpose of `AT_DISPATCH_FLOATING_TYPES` is to take care of this dispatch for us. It takes a type (`gates.type()` in our case), a name (for error messages) and a lambda function. Inside this lambda function, the type alias `scalar_t` is available and is defined as the type that the tensor actually is at runtime in that context. As such, if we have a template function (which will submit the actual SYCL kernel), we can instantiate it with this `scalar_t` alias, and the correct function will be called. In this case, we also want to retrieve the data pointers of the tensors as pointers of that `scalar_t` type. If you wanted to dispatch over all types and not just floating point types (`Float` and `Double`), you can use `AT_DISPATCH_ALL_TYPES`. + +Here's how to submit the actual kernel to the XPU device: + +```cpp +template +void lltm_xpu_forward_kernel( + const scalar_t* gates, + const scalar_t* old_cell, + scalar_t* new_h, + scalar_t* new_cell, + scalar_t* input_gate, + scalar_t* output_gate, + scalar_t* candidate_cell, + size_t state_size, + size_t batch_size) { + + const int threads = 1024; + const int work_groups = (state_size + threads - 1) / threads; + + // define the kernel + auto cgf = [&](sycl::handler& cgh) { + auto kfn = [=](sycl::nd_item<2> item) { + + const int column = item.get_group(0) * item.get_group_range(0) + item.get_local_id(0); + const int index = item.get_group(1) * state_size + column; + const int gates_row = item.get_group(1) * (state_size * 3); + + if (column < state_size) { + input_gate[index] = sigmoid(gates[gates_row + column]); + output_gate[index] = sigmoid(gates[gates_row + state_size + column]); + candidate_cell[index] = elu(gates[gates_row + 2 * state_size + column]); + new_cell[index] = + old_cell[index] + candidate_cell[index] * input_gate[index]; + new_h[index] = tanh(new_cell[index]) * output_gate[index]; + } + + }; + + cgh.parallel_for( + sycl::nd_range<2>( + sycl::range<2>(work_groups * threads, batch_size), + sycl::range<2>(threads, 1)), + kfn); + }; + + // submit kernel + c10::xpu::XPUStream stream = c10::xpu::getCurrentXPUStream(); + stream.queue().submit(cgf); +} +``` + +We're specifying that each work group has 1024 threads and that the entire GPU grid is split into as many work groups of 1 x 1024 threads as are required to fill our matrices with one thread per component. For example, if our state size was 2048 and our batch size 4, we’d launch a total of 4 x 2 = 8 work groups with 1024 threads each. If you are not familiar with the SYCL “work groups”, an introductory read about SYCL may help. + +Note that the `c10::impl::VirtualGuardImpl` must get the current stream of the current XPU device and use the XPU API to get the corresponding SYCL underlaying queue. It can then submit the kernel to the queue for execution. + +#### Using accessors + +You can see in the SYCL kernel that we work directly on pointers with the right type. Indeed, working directly with high level type agnostic tensors inside SYCL kernels would be very inefficient. + +However, this comes at a cost of ease of use and readability, especially for highly dimensional data. We can use torch's C++ utils to abstract access to high dimension data in the SYCL kernel directly. + +The backwards pass follows much the same pattern but with the `torch::PackedTensorAccessor32`. You can get more information about these utils in torch documents: + +```cpp +template +void lltm_xpu_backward_kernel( + torch::PackedTensorAccessor32 d_old_cell, + torch::PackedTensorAccessor32 d_gates, + const torch::PackedTensorAccessor32 grad_h, + const torch::PackedTensorAccessor32 grad_cell, + const torch::PackedTensorAccessor32 new_cell, + const torch::PackedTensorAccessor32 input_gate, + const torch::PackedTensorAccessor32 output_gate, + const torch::PackedTensorAccessor32 candidate_cell, + const torch::PackedTensorAccessor32 gate_weights, + size_t state_size, + size_t batch_size) { + + const int threads = 1024; + const int work_groups = (state_size + threads - 1) / threads; + + // define the kernel + auto cgf = [&](sycl::handler& cgh) { + auto kfn = [=](sycl::nd_item<2> item) { + //batch index + const int n = item.get_group(1); + // column index + const int c = item.get_group(0) * item.get_group_range(0) + item.get_local_id(0); + auto d_gates_ = d_gates; + auto d_old_cell_ = d_old_cell; + if (c < d_gates.size(2)){ + const auto d_output_gate = tanh(new_cell[n][c]) * grad_h[n][c]; + const auto d_tanh_new_cell = output_gate[n][c] * grad_h[n][c]; + const auto d_new_cell = + d_tanh(new_cell[n][c]) * d_tanh_new_cell + grad_cell[n][c]; + + + d_old_cell_[n][c] = d_new_cell; + const auto d_candidate_cell = input_gate[n][c] * d_new_cell; + const auto d_input_gate = candidate_cell[n][c] * d_new_cell; + + d_gates_[n][0][c] = + d_input_gate * d_sigmoid(gate_weights[n][0][c]); + d_gates_[n][1][c] = + d_output_gate * d_sigmoid(gate_weights[n][1][c]); + d_gates_[n][2][c] = + d_candidate_cell * d_elu(gate_weights[n][2][c]); + } + }; + + cgh.parallel_for( + sycl::nd_range<2>( + sycl::range<2>(work_groups * threads, batch_size), + sycl::range<2>(threads, 1)), + kfn); + }; + + // submit kernel + c10::xpu::XPUStream stream = c10::xpu::getCurrentXPUStream(); + stream.queue().submit(cgf); +} + +std::vector lltm_xpu_backward( + torch::Tensor grad_h, + torch::Tensor grad_cell, + torch::Tensor new_cell, + torch::Tensor input_gate, + torch::Tensor output_gate, + torch::Tensor candidate_cell, + torch::Tensor X, + torch::Tensor gates, + torch::Tensor weights) { + auto d_old_cell = torch::zeros_like(new_cell); + auto d_gates = torch::zeros_like(gates); + + const auto batch_size = new_cell.size(0); + const auto state_size = new_cell.size(1); + + AT_DISPATCH_FLOATING_TYPES(X.type(), "lltm_backward_xpu", ([&] { + lltm_xpu_backward_kernel( + d_old_cell.packed_accessor32(), + d_gates.packed_accessor32(), + grad_h.packed_accessor32(), + grad_cell.packed_accessor32(), + new_cell.packed_accessor32(), + input_gate.packed_accessor32(), + output_gate.packed_accessor32(), + candidate_cell.packed_accessor32(), + gates.packed_accessor32(), + state_size, + batch_size); + })); + + auto d_gate_weights = d_gates.reshape({batch_size, 3*state_size}); + auto d_weights = d_gate_weights.t().mm(X); + auto d_bias = d_gate_weights.sum(/*dim=*/0, /*keepdim=*/true); + + auto d_X = d_gate_weights.mm(weights); + auto d_old_h = d_X.slice(/*dim=*/1, 0, state_size); + auto d_input = d_X.slice(/*dim=*/1, state_size); + + return {d_old_h, d_input, d_weights, d_bias, d_old_cell, d_gates}; +} +``` diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/FSDP.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/FSDP.md.txt new file mode 100644 index 000000000..9d2b9376f --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/FSDP.md.txt @@ -0,0 +1,300 @@ +Fully Sharded Data Parallel (FSDP) +============================= + +## Introduction + +`Fully Sharded Data Parallel (FSDP)` is a PyTorch\* module that provides industry-grade solution for large model training. FSDP is a type of data parallel training, unlike DDP, where each process/worker maintains a replica of the model, FSDP shards model parameters, optimizer states and gradients across DDP ranks to reduce the GPU memory footprint used in training. This makes the training of some large-scale models feasible. Please refer to [FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) for an introduction to FSDP. + +To run FSDP on GPU, similar to DDP, we use Intel® oneCCL Bindings for Pytorch\* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library\* (oneCCL), a library for efficient distributed deep learning training implementing collectives such as `AllGather`, `ReduceScatter`, and other needed by FSDP. Refer to [oneCCL Github page](https://github.com/oneapi-src/oneCCL) for more information about oneCCL. +To install Intel® oneCCL Bindings for Pytorch\*, follow the same installation steps as for DDP. + +## FSDP Usage (GPU only) + +FSDP is designed to align with PyTorch conventions. To use FSDP with Intel® Extension for PyTorch\*, make the following modifications to your model script: + +1. Import the necessary packages. +```python +import torch +import intel_extension_for_pytorch +import oneccl_bindings_for_pytorch +from torch.distributed.fsdp import FullyShardedDataParallel as FSDP +``` + +2. Initialize the process group with ccl backend. +```python +dist.init_process_group(backend='ccl') +``` + +3. For FSDP with each process exclusively working on a single GPU, set the device ID as `local rank`. +```python +torch.xpu.set_device("xpu:{}".format(rank)) +# or +device = "xpu:{}".format(args.local_rank) +torch.xpu.set_device(device) +``` + +4. Wrap model by FSDP. +```python +model = model.to(device) +model = FSDP(model, device_id=device) +``` + +**Note**: for FSDP with XPU, you need to specify `device_ids` with XPU device; otherwise, it will trigger the CUDA path and throw an error. + +## Example + +Here's an example based on [PyTorch FSDP Tutorial](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) to illustrate the usage of FSDP on XPU and the necessary changes to switch from CUDA to an XPU case. + +1. Import necessary packages: + +```python +""" +Import Intel® extension for Pytorch\* and Intel® oneCCL Bindings for Pytorch\* +""" +import os +import argparse +import functools +import torch +import torch.nn as nn +import torch.nn.functional as F +import torch.optim as optim + +# Import Intel® extension for Pytorch\* and Intel® oneCCL Bindings for Pytorch\* +import intel_extension_for_pytorch +import oneccl_bindings_for_pytorch + +from torchvision import datasets, transforms +from torch.optim.lr_scheduler import StepLR + +import torch.distributed as dist +import torch.multiprocessing as mp +from torch.nn.parallel import DistributedDataParallel as DDP +from torch.utils.data.distributed import DistributedSampler +from torch.distributed.fsdp import FullyShardedDataParallel as FSDP +from torch.distributed.fsdp.fully_sharded_data_parallel import ( + CPUOffload, + BackwardPrefetch, +) +from torch.distributed.fsdp.wrap import ( + size_based_auto_wrap_policy, + enable_wrap, + wrap, +) +``` + +2. Set up distributed training: + +```python +""" +Set the initialize the process group backend as Intel® oneCCL Bindings for Pytorch\* +""" +def setup(rank, world_size): + os.environ['MASTER_ADDR'] = 'localhost' + os.environ['MASTER_PORT'] = '12355' + + # initialize the process group by Intel® oneCCL Bindings for Pytorch\* + dist.init_process_group("ccl", rank=rank, world_size=world_size) + +def cleanup(): + dist.destroy_process_group() +``` + +3. Define the toy model for handwritten digit classification: + +```python +class Net(nn.Module): + def __init__(self): + super(Net, self).__init__() + self.conv1 = nn.Conv2d(1, 32, 3, 1) + self.conv2 = nn.Conv2d(32, 64, 3, 1) + self.dropout1 = nn.Dropout(0.25) + self.dropout2 = nn.Dropout(0.5) + self.fc1 = nn.Linear(9216, 128) + self.fc2 = nn.Linear(128, 10) + + def forward(self, x): + + x = self.conv1(x) + x = F.relu(x) + x = self.conv2(x) + x = F.relu(x) + x = F.max_pool2d(x, 2) + x = self.dropout1(x) + x = torch.flatten(x, 1) + x = self.fc1(x) + x = F.relu(x) + x = self.dropout2(x) + x = self.fc2(x) + output = F.log_softmax(x, dim=1) + return output +``` + +4. Define a training function: + +```python +""" +Change the device related logic from 'rank' to '"xpu:{}".format(rank)' +""" +def train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=None): + model.train() + # XPU device should be formatted as string, replace the rank with '"xpu:{}".format(rank)' + ddp_loss = torch.zeros(2).to("xpu:{}".format(rank)) + if sampler: + sampler.set_epoch(epoch) + for batch_idx, (data, target) in enumerate(train_loader): + data, target = data.to("xpu:{}".format(rank)), target.to("xpu:{}".format(rank)) + optimizer.zero_grad() + output = model(data) + loss = F.nll_loss(output, target, reduction='sum') + loss.backward() + optimizer.step() + ddp_loss[0] += loss.item() + ddp_loss[1] += len(data) + + dist.all_reduce(ddp_loss, op=dist.ReduceOp.SUM) + if rank == 0: + print('Train Epoch: {} \tLoss: {:.6f}'.format(epoch, ddp_loss[0] / ddp_loss[1])) +``` + +5. Define a validation function: + +```python +""" +Change the device related logic from 'rank' to '"xpu:{}".format(rank)' +""" +def test(model, rank, world_size, test_loader): + model.eval() + correct = 0 + # XPU device should be formatted as string, replace the rank with '"xpu:{}".format(rank)' + ddp_loss = torch.zeros(3).to("xpu:{}".format(rank)) + with torch.no_grad(): + for data, target in test_loader: + data, target = data.to("xpu:{}".format(rank)), target.to("xpu:{}".format(rank)) + output = model(data) + ddp_loss[0] += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss + pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability + ddp_loss[1] += pred.eq(target.view_as(pred)).sum().item() + ddp_loss[2] += len(data) + + dist.all_reduce(ddp_loss, op=dist.ReduceOp.SUM) + + if rank == 0: + test_loss = ddp_loss[0] / ddp_loss[2] + print('Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format( + test_loss, int(ddp_loss[1]), int(ddp_loss[2]), + 100. * ddp_loss[1] / ddp_loss[2])) +``` + +6. Define a distributed training function that wraps the model in FSDP: + +```python +""" +Change the device related logic from 'rank' to '"xpu:{}".format(rank)'. +Specify the argument `device_ids` as XPU device ("xpu:{}".format(rank)) in FSDP API. +""" +def fsdp_main(rank, world_size, args): + setup(rank, world_size) + + transform=transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize((0.1307,), (0.3081,)) + ]) + + dataset1 = datasets.MNIST('../data', train=True, download=True, + transform=transform) + dataset2 = datasets.MNIST('../data', train=False, + transform=transform) + + sampler1 = DistributedSampler(dataset1, rank=rank, num_replicas=world_size, shuffle=True) + sampler2 = DistributedSampler(dataset2, rank=rank, num_replicas=world_size) + + train_kwargs = {'batch_size': args.batch_size, 'sampler': sampler1} + test_kwargs = {'batch_size': args.test_batch_size, 'sampler': sampler2} + xpu_kwargs = {'num_workers': 2, + 'pin_memory': True, + 'shuffle': False} + train_kwargs.update(xpu_kwargs) + test_kwargs.update(xpu_kwargs) + + train_loader = torch.utils.data.DataLoader(dataset1,**train_kwargs) + test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs) + my_auto_wrap_policy = functools.partial( + size_based_auto_wrap_policy, min_num_params=100 + ) + torch.xpu.set_device("xpu:{}".format(rank)) + + + init_start_event = torch.xpu.Event(enable_timing=True) + init_end_event = torch.xpu.Event(enable_timing=True) + + model = Net().to("xpu:{}".format(rank)) + # Specify the argument `device_ids` as XPU device ("xpu:{}".format(rank)) in FSDP API. + model = FSDP(model, device_id="xpu:{}".format(rank)) + + optimizer = optim.Adadelta(model.parameters(), lr=args.lr) + + scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma) + init_start_event.record() + for epoch in range(1, args.epochs + 1): + train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=sampler1) + test(model, rank, world_size, test_loader) + scheduler.step() + + init_end_event.record() + + if rank == 0: + print(f"XPU event elapsed time: {init_start_event.elapsed_time(init_end_event) / 1000}sec") + print(f"{model}") + + if args.save_model: + # use a barrier to make sure training is done on all ranks + dist.barrier() + states = model.state_dict() + if rank == 0: + torch.save(states, "mnist_cnn.pt") + + cleanup() +``` + +7. Finally, parse the arguments and set the main function: + +```python +""" +Replace CUDA runtime API with XPU runtime API. +""" +if __name__ == '__main__': + # Training settings + parser = argparse.ArgumentParser(description='PyTorch MNIST Example') + parser.add_argument('--batch-size', type=int, default=64, metavar='N', + help='input batch size for training (default: 64)') + parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N', + help='input batch size for testing (default: 1000)') + parser.add_argument('--epochs', type=int, default=10, metavar='N', + help='number of epochs to train (default: 14)') + parser.add_argument('--lr', type=float, default=1.0, metavar='LR', + help='learning rate (default: 1.0)') + parser.add_argument('--gamma', type=float, default=0.7, metavar='M', + help='Learning rate step gamma (default: 0.7)') + parser.add_argument('--no-cuda', action='store_true', default=False, + help='disables CUDA training') + parser.add_argument('--seed', type=int, default=1, metavar='S', + help='random seed (default: 1)') + parser.add_argument('--save-model', action='store_true', default=False, + help='For Saving the current Model') + args = parser.parse_args() + + torch.manual_seed(args.seed) + + WORLD_SIZE = torch.xpu.device_count() + mp.spawn(fsdp_main, + args=(WORLD_SIZE, args), + nprocs=WORLD_SIZE, + join=True) +``` + +8. Put the above code snippets to a python script `FSDP_mnist_xpu.py`, and run: + +```bash +python FSDP_mnist_xpu.py +``` + diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/advanced_configuration.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/advanced_configuration.md.txt new file mode 100644 index 000000000..124ab0e5c --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/advanced_configuration.md.txt @@ -0,0 +1,86 @@ +Advanced Configuration +====================== + +The default settings for Intel® Extension for PyTorch\* are sufficient for most use cases. However, if users want to customize Intel® Extension for PyTorch\*, advanced configuration is available at build time and runtime. + +## Build Time Configuration + +The following build options are supported by Intel® Extension for PyTorch\*. Users who install Intel® Extension for PyTorch\* via source compilation could override the default configuration by explicitly setting a build option ON or OFF, and then build. + +| **Build Option** | **Default
Value** | **Description** | +| ------ | ------ | ------ | +| USE_ONEMKL | ON | Use oneMKL BLAS | +| USE_CHANNELS_LAST_1D | ON | Use channels last 1d | +| USE_PERSIST_STREAM | ON | Use persistent oneDNN stream | +| USE_SCRATCHPAD_MODE | ON | Use oneDNN scratchpad mode | +| USE_PRIMITIVE_CACHE | ON | Cache oneDNN primitives by FRAMEWORK for specific operators | +| USE_QUEUE_BARRIER | ON | Use queue submit_barrier, otherwise use dummy kernel | +| USE_PTI | ON | Build XPU Profiler with PTI support. | +| USE_DS_KERNELS | ON | Build deepspeed kernels | +| USE_SYCL_ASSERT | OFF | Enables assert in sycl kernel | +| USE_ITT_ANNOTATION | OFF | Enables ITT annotation in sycl kernel | +| USE_SPLIT_FP64_LOOPS | ON | Split FP64 loops into separate kernel for element-wise kernels | +| BUILD_BY_PER_KERNEL | OFF | Build by DPC++ per_kernel option (exclusive with USE_AOT_DEVLIST) | +| BUILD_INTERNAL_DEBUG | OFF | Use internal debug code path | +| BUILD_SEPARATE_OPS | OFF | Build each operator in separate library | +| BUILD_SIMPLE_TRACE | ON | Build simple trace for each registered operator | +| USE_AOT_DEVLIST | "" | Set device list for AOT build | +| USE_XETLA | "ON" | Use XeTLA based customer kernels; Specify a comma-sep list of gpu architectures (e.g. xe_lpg,xe_hpg) to only enable kernels for specific platforms | +| USE_ONEDNN_DIR | "" | Specify oneDNN source path which contains its include directory and lib directory | +| USE_XETLA_SRC | "${IPEX_GPU_ROOT_DIR}/aten/operators/xetla/kernels/" | Specify XETLA source path which contains its include dir | +| BUILD_OPT_LEVEL | "" | Add build option -Ox, accept values: 0/1 | +| BUILD_WITH_SANITIZER | "" | Build with sanitizer check. Support one of address, thread, and leak options at a time. The default option is address. | + +For above build options which can be configured to ON or OFF, users can configure them to 1 or 0 also, while ON equals to 1 and OFF equals to 0. + +## Runtime Configuration + +The following launch options are supported in Intel® Extension for PyTorch\*. Users who execute AI models on XPU could override the default configuration by explicitly setting the option value at runtime using environment variables, and then launch the execution. + +| **Launch Option
CPU, GPU** | **Default
Value** | **Description** | +| ------ | ------ | ------ | +| IPEX_FP32_MATH_MODE | FP32 | Set values for FP32 math mode (valid values: FP32, TF32, BF32). Refer to API Documentation for details. | + +| **Launch Option
GPU ONLY** | **Default
Value** | **Description** | +| ------ | ------ | ------ | +| IPEX_VERBOSE | 0 | Set verbose level with synchronization execution mode, will be deprecated very soon. Please use IPEX_LOG_LEVEL instead. | +| IPEX_XPU_SYNC_MODE | 0 | Set 1 to enforce synchronization execution mode, will be deprecated very soon. | +| IPEX_LOG_LEVEL | -1 | Set log level to trace the execution and get log information, pls refer to 'ipex_log.md' for different log level. | +| IPEX_LOG_COMPONENT | "ALL" | Set IPEX_LOG_COMPONENT = ALL to log all component message. Use ';' as separator to log more than one components, such as "OPS;RUNTIME". Use '/' as separator to log subcomponents. | +| IPEX_LOG_ROTATE_SIZE | -1 | Set Rotate file size in MB for IPEX_LOG, less than 0 means unuse this setting. | +| IPEX_LOG_SPLIT_SIZE | -1 | Set split file size in MB for IPEX_LOG, less than 0 means unuse this setting. | +| IPEX_LOG_OUTPUT | "" | Set output file path for IPEX_LOG, default is null | + +| **Launch Option
Experimental** | **Default
Value** | **Description** | +| ------ | ------ | ------ | +| IPEX_SIMPLE_TRACE | 0 | Set 1 to enable simple trace for all operators\*, will be deprecated very soon. Please use IPEX_LOG_LEVEL instead. | + +| **Distributed Option
GPU ONLY** | **Default
Value** | **Description** | +| ------ | ------ | ------ | +| TORCH_LLM_ALLREDUCE | 0 | This is a prototype feature to provide better scale-up performance by enabling optimized collective algorithms in oneCCL and asynchronous execution in torch-ccl. This feature requires XeLink enabled for cross-cards communication. By default, this feature is not enabled with setting 0. | +| CCL_BLOCKING_WAIT | 0 | This is a prototype feature to control over whether collectives execution on XPU is host blocking or non-blocking. By default, setting 0 enables blocking behavior. | +| CCL_SAME_STREAM | 0 | This is a prototype feature to allow using a computation stream as communication stream to minimize overhead for streams synchronization. By default, setting 0 uses separate streams for communication. | + +For above launch options which can be configured to 1 or 0, users can configure them to ON or OFF also, while ON equals to 1 and OFF equals to 0. + +Examples to configure the launch options:
+ +- Set one or more options before running the model + +```bash +export IPEX_LOG_LEVEL=1 +export IPEX_FP32_MATH_MODE=TF32 +... +python ResNet50.py +``` +- Set one option when running the model + +```bash +IPEX_LOG_LEVEL=1 python ResNet50.py +``` + +- Set more than one options when running the model + +```bash +IPEX_LOG_LEVEL=1 IPEX_FP32_MATH_MODE=TF32 python ResNet50.py +``` diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/amp_gpu.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/amp_gpu.md.txt new file mode 100644 index 000000000..b82163d43 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/amp_gpu.md.txt @@ -0,0 +1,107 @@ +Auto Mixed Precision (AMP) on GPU +================================= + +## Introduction + +`torch.xpu.amp` provides convenience for auto data type conversion at runtime. Deep learning workloads can benefit from lower-precision floating point data types such as `torch.float16` or `torch.bfloat16`, because of its lighter calculation workload and smaller memory usage. Accuracy is sacrificed when using lower-precision floating point data types so there's a trade-off between accuracy and performance. Thus, some operations should use the slower but more accurate `torch.float32`, while others can be converted to use the faster but less accurate `torch.float16` or `torch.bfloat16` data type. The Auto Mixed Precision (AMP) feature automates the tuning of data type conversions over all operators. + +Inference workloads using `torch.xpu.amp` support `torch.bfloat16` and `torch.float16`. Training workloads using `torch.xpu.amp` support `torch.bfloat16`. `torch.bfloat16` is the default lower precision floating point data type when `torch.xpu.amp` is enabled. + +## Use Case + +The following simple network should show a speedup with mixed precision. + +```python +class SimpleNet(torch.nn.Module): + def __init__(self): + super(SimpleNet, self).__init__() + self.conv = torch.nn.Conv2d(64, 128, (3, 3), stride=(2, 2), padding=(1, 1), bias=False) + + def forward(self, x): + return self.conv(x) +``` + +### Default Precision + +Without `torch.xpu.amp`, the network executes all operators with default precision (`torch.float32`). + +```python +model = SimpleNet().to("xpu") +x = torch.rand(64, 64, 224, 224).to("xpu") +y = model(x) +``` + +### Inference with Imperative Path + +`torch.xpu.amp.autocast` is designed to be a context manager that allow scopes of your script to run with mixed precision. In these scopes, operations run in a data type chosen by the `autocast` class to improve performance while maintaining accuracy. See the operations category section for details on what precision the `autocast` class chooses for each operator, and under what circumstances. + +```python +model = SimpleNet().to("xpu").eval() +x = torch.rand(64, 64, 224, 224).to("xpu") +with torch.xpu.amp.autocast(dtype=torch.float16): + y = model(x) +``` + +### Inference with TorchScript Path + +`torch.xpu.amp.autocast` can be used with `torch.jit.trace` to apply graph optimization. Due to PyTorch limitation, only `torch.jit.trace` is supported. + +```python +model = SimpleNet().to("xpu").eval() +x = torch.rand(64, 64, 224, 224).to("xpu") +with torch.xpu.amp.autocast(dtype=torch.float16): + model = torch.jit.trace(model, x) + model = torch.jit.freeze(model) + y = model(x) +``` + +### Training Support + +`torch.xpu.amp.autocast` can be used in training to improve performance. + +```python +model = SimpleNet().to("xpu") +optimizer = torch.optim.SGD(model.parameters(), lr=0.001) +for images, label in train_loader(): + with torch.xpu.amp.autocast(): + loss = criterion(model(images.to("xpu")), label.to("xpu")) + loss.backward() + optimizer.step() +``` + +## Autocast Op Reference + +### Op Eligibility + +Ops that run in `float64` or non-floating-point dtypes are not eligible for mixed precision, and will run in these types whether or not autocast is enabled. + +Only out-of-place ops and Tensor methods are eligible for mixed precision. In-place variants and calls that explicitly supply an `out=...` Tensor +are allowed in autocast-enabled regions, but won't go through autocasting. For example, in an autocast-enabled region `a.addmm(b, c)` can autocast, but `a.addmm_(b, c)` and `a.addmm(b, c, out=d)` cannot. For best performance and stability, use out-of-place ops in autocast-enabled regions. + +### Op-Specific Behavior + +The following lists describe the behavior of eligible ops in autocast-enabled regions. These ops always go through autocasting whether they are invoked as part of a `torch.nn.Module`, as a function, or as a `torch.Tensor` method. If functions are exposed in multiple namespaces, they go through autocasting regardless of the namespace. + +Ops not listed below do not go through autocasting. They run in the type defined by their inputs. However, autocasting may still change the type in which unlisted ops run if they're downstream from autocasted ops. + +If an op is unlisted, we assume it's numerically stable in `bfloat16` or `float16`. If you believe that an unlisted op is numerically unstable in `bfloat16` or `float16`, file a [GitHub issue](https://github.com/intel/intel-extension-for-pytorch/issues). + +#### Ops that can autocast to `bfloat16` + +`conv1d`, `conv2d`, `conv3d`, `_convolution`, `convolution`, `conv_tbc`, `conv_transpose1d`, `conv_transpose1d`, `conv_transpose3d`, `prelu`, `addmm`, `addmv`, `addr`, `linear`, `matmul`, `mm`, `mv`, `bmm`, `baddbmm`, `addbmm`, `chain_matmul`, `linalg_multi_dot`, `_thnn_fused_gru_cell`, `gru_cell`, `scaled_dot_product_attention` + +#### Ops that can autocast to `float16` + +`conv1d`, `conv2d`, `conv3d`, `_convolution`, `convolution`, `conv_tbc`, `conv_transpose1d`, `conv_transpose1d`, `conv_transpose3d`, `prelu`, `addmm`, `addmv`, `addr`, `linear`, `matmul`, `mm`, `mv`, `bmm`, `baddbmm`, `addbmm`, `chain_matmul`, `linalg_multi_dot`, `_thnn_fused_gru_cell`, `gru_cell`, `scaled_dot_product_attention` + +#### Ops that can autocast to `float32` + +`binary_cross_entropy`, `binary_cross_entropy_with_logits`, `log_softmax`, `nll_loss`, `nll_loss2d`, `nll_loss_nd`, `cross_entropy_loss`, `fft_fft`, `fft_ifft`, `fft_fft2`, `fft_ifft2`, `fft_fftn`, `fft_ifftn`, `fft_rfft`, `fft_irfft`, `fft_rfft2`, `fft_irfft2`, `fft_rfftn`, `fft_irfftn`, `fft_hfft`, `fft_ihfft`, `reciprocal`, `pow`, `frobenius_norm`, `nuclear_norm`, `cosine_similarity`, `poisson_nll_loss`, `cosine_embedding_loss`, `hinge_embedding_loss`, `kl_div`, `l1_loss`, `smooth_l1_loss `, `huber_loss`, `mse_loss`, `margin_ranking_loss`, `multilabel_margin_loss`, `soft_margin_loss`, `triplet_margin_loss`, `multi_margin_loss`, `dist`, `pdist`, `cdist`, `renorm` + +#### Ops that promote to the widest input type + +These ops don't require a particular dtype for stability, but take multiple inputs and require that the inputs' dtypes match. If all of the inputs are `bfloat16`, the op runs in `bfloat16`. If any of the inputs is `float32`, autocast casts all inputs to `float32` and runs the op in `float32`. + +`cat`, `stack`, `addcdiv`, `addcmul`, `atan2`, `bilinear`, `cross`, `dot`, `grid_sampler`, `index_put`, `tensordot`, `scatter_add` + +Some ops not listed here (e.g., binary ops such as `add`) natively promote inputs without autocasting's intervention. If inputs are a mixture of `bfloat16` and `float32`, these ops run in `float32` and produce `float32` output, regardless of whether autocast is enabled. diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/auto_channels_last.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/auto_channels_last.md.txt new file mode 100644 index 000000000..ffa1b6194 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/auto_channels_last.md.txt @@ -0,0 +1,32 @@ +Auto Channels Last +================== + +Channels last memory format is known to have performance advantage over channels first memory format. Refer to [Channels Last](./nhwc.md) for details. +Intel® Extension for PyTorch\* automatically converts the model to channels last memory format by default when users optimize their model with `ipex.optimize(model)`. + +## Ease-of-use auto channels last API + +**Note:** Auto channels last APIs `ipex.enable_auto_channels_last()` and `ipex.disable_auto_channels_last()` will be deprecated in future releases. + +#### default +```python +model = ipex.optimize(model) # by default, model is channels last +``` + +#### enable +```python +ipex.enable_auto_channels_last() # This API will be deprecated in future releases. +model = ipex.optimize(model) # enable, model is channels last +``` + +#### disable +```python +ipex.disable_auto_channels_last() # This API will be deprecated in future releases. +model = ipex.optimize(model) # disable, model is channels first +``` + +## Known issue +For broad models, channels last memory format brings performance boost over channels first memory format. However, for few use cases, this may bring performance regression. If performance regression is observed, we recommend to feed sample input data to `ipex.optimize(model, sample_input=...)`. +```python +model = ipex.optimize(model, sample_input=...) +``` diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/compute_engine.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/compute_engine.md.txt new file mode 100644 index 000000000..fe1392c34 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/compute_engine.md.txt @@ -0,0 +1,60 @@ +Compute Engine (Experimental feature for debug) +=============================================== + +## Introduction + +Compute engine provides the capacity to choose specific backend for operators with multiple implementations. For example, with compute engine set, we can prefer to using SYCL than oneDNN implementation for concatenation. The feature can help user to customize model forward behavior for better performance or special requirement. + +We currently support 5 compute engines, namely, `RECOMMEND`, `BASIC`, `ONEDNN`, `ONEMKL`, `XETLA`. Each op with multiple implementations has a recommend one based on our empirical experience. The `RECOMMEND` engine would guarantee performance on most shape input ideally. `BASIC` engines refers to SYCL implementation. `ONEDNN`, `ONEMKL`, `XETLA` refers to optimized implementation provided by library [Intel® oneAPI Deep Neural Network Library (oneDNN)](https://github.com/oneapi-src/oneDNN), [Intel® oneAPI Math Kernel Library (oneMKL)](https://github.com/oneapi-src/oneMKL) and [Intel® Xe Templates for Linear Algebra](https://github.com/intel/xetla). + +## Use Case + +Code snippet below demonstrates the usage of compute engine feature to select oneDNN as the compute engine of operator `torch.cat`. + +```python +with torch.xpu.compute_eng(torch.xpu.XPUComputeEng.ONEDNN): + x1 = torch.randn((1, 3, 20, 20), device="xpu") + x2 = torch.randn((1, 5, 20, 20), device="xpu") + torch.cat([x1, x2], dim=1) +``` + +## Engine Selection Policy +Generally, priority of choosing engine follows the order `operator special argument > onednn_layout format input > user set engine > recommend engine`. Check the following for details: + +Step 1: In some cases, operators with specific arguments may not have implementations for all compute engines. For these operators, the implemented compute engines have the highest priority in the selection process. For example, operator `torch.nn.Upsample` with argument `align_corners=True` has only SYCL implementation for GPU. Thus, the BASIC engine, referring to SYCL implementations, is always its computing engine. + +Step2: If no special argument, and inputs contain `ONEDNN_LAYOUT` Tensor, `ONEDNN` engine would be chosen if possible. This would utilize the highly optimized code in library oneDNN to speedup computation. If `oneDNN` has no support for the operator, engine selection process continues to next step. + +Step3: If user manually set a engine, this engine is chosen once the operator supports this implementation. + +Step4: If the compute engine designated by user is not implemented/available, execution of the operator will fall back on to the `RECOMMEND` engine. + +![fig-2(1)-pt-conv-layout-path-dispatch](../../images/compute_eng/compute_eng_arc.png) + + +## Multiple Implementations Operators and Engines +`AveragePool2d`: `ONEDNN`, `BASIC` [Recommend] + +`Concat`: `ONEDNN`, `BASIC` [Recommend] + +`MaxPool2d`, `MaxPool3d`: `ONEDNN`, `BASIC` [Recommend] + +`LSTM`: `ONEDNN`, `BASIC` [Recommend] + + Basic is recommended currently. When optimizations in oneDNN finish, `ONEDNN` would be the recommend engine. + +`LayerNorm`: `ONEDNN`, `BASIC` [Recommend] + +`PermuteContiguous`: `ONEDNN`, `BASIC` [Recommend] + +`SoftMax`: `ONEDNN`, `BASIC` [Recommend] + + The `BASIC` engine is always chosen if input tensor has `dimension` greater than 3 or its `dtype` is other than `fp16, fp32` or `bfloat16`. + +`UpsampleBlinear2d`: `ONEDNN`, `BASIC` [Recommend] + + The `BASIC` engine is always chosen if argument `align_corners=True`. + +`UpsampleNearest`: `ONEDNN`, `BASIC` [Recommend] + + The `ONEDNN` engine is always chosen if output shape is divisible by the input shape. diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/deepspeed_kernels.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/deepspeed_kernels.md.txt new file mode 100644 index 000000000..a2da12fad --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/deepspeed_kernels.md.txt @@ -0,0 +1,15 @@ +Intel® Extension for PyTorch\* - DeepSpeed\* Kernels +===================================================== +(intel_extension_for_pytorch.deepspeed module) + +## Introduction +[DeepSpeed](https://github.com/microsoft/DeepSpeed)\* creates custom kernels for its feature support and performance optimizations. The DeepSpeed custom kernels for Intel XPU device are integrated into Intel® Extension for PyTorch\* under the ecological library category. It worths noting that the kernels are designed specifically for DeepSpeed\* therefore it is NOT necessarily common or validated when being used in scenarios other than DeepSpeed\*. + +The DeepSpeed\* kernels module provides below custom kernels for DeepSpeed\*: +- quantization: including quantize/dequantize with fp32/fp16, etc +- transformer inference: including the bias GeGLU, layernorm, layernorm + residual, layernorm + store pre layernorm residual, RMS norm, pre RMS norm, vector add, MLP with fp16, MoE residual matmul, reset cache, release/retake workspace etc. + +## Supported Platform +This module supports xpu device on Intel® Data Center GPU Max Series only. + + diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/float8.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/float8.md.txt new file mode 100644 index 000000000..37607da87 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/float8.md.txt @@ -0,0 +1,43 @@ +Float8 Data Type Support (Prototype) +==================================== + +## Float8 Data Type + +Float8 (FP8) is a 8-bit floating point data type, which is used to reduce memory footprint, improve the computation efficiency and save power in Deep Learning domain. + +Two formats are used in FP8 training and inference, in order to meet the required value range and precision of activation, weight and gradient in Deep Neural Network (DNN). One is E4M3 (sign-exponent-mantissa) for activation and weight, the other is E5M2 for gradients. These two formats are defined in [FP8 FORMATS FOR DEEP LEARNING](https://arxiv.org/pdf/2209.05433.pdf). + +## FP8 Quantization + +On GPU, online Dynamic Quantization is used for FP8 data compression and decompression. Delayed Scaling algorithm is used for accelerating the quantizaiton process. + +## Supported running mode + +Both DNN Training and Inference are supported with the FP8 data type. + +## Supported operators + +FP8 Linear operator is supported. + +## FP8 usage example + +BERT model is supported as a FP8 training showcase, see the following example: + +```python +from intel_extension_for_pytorch.quantization.fp8 import ( + fp8_autocast, + DelayedScaling, + Format, + FP8Linear, +) + +## Convert the original model to a new model composed of FP8 operators. +fp8_model = prepare_fp8(model) +## Run FP8 model. +with fp8_autocast(enabled=True, fp8_recipe=DelayedScaling()): + outputs = fp8_model(input_ids=input_ids, + token_type_ids=segment_ids, + attention_mask=input_mask, + labels=masked_lm_labels, + next_sentence_label=next_sentence_labels) +``` diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/horovod.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/horovod.md.txt new file mode 100644 index 000000000..cae897fc8 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/horovod.md.txt @@ -0,0 +1,109 @@ +Horovod with PyTorch (Prototype) +================================ + +Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. Horovod core principles are based on MPI concepts such as size, rank, local rank, allreduce, allgather, broadcast, and alltoall. To use Horovod with PyTorch, you need to install Horovod with Pytorch first, and make specific change for Horovod in your training script. + +## Install Horovod with PyTorch + +You can use normal pip command to install [Intel® Optimization for Horovod\*](https://pypi.org/project/intel-optimization-for-horovod/): + +```bash +python -m pip install intel-optimization-for-horovod +``` + +**Note:** Make sure you already install oneAPI basekit. You need to activate the environment when use Horovod. + +```bash +source ${HOME}/intel/oneapi/ccl/latest/env/vars.sh +``` + +## Horovod with PyTorch Usage + +To use Horovod with PyTorch for XPU backend, make the following modifications to your training script: + +1. Initialize Horovod. + + + import torch + import intel_extension_for_pytorch + import horovod.torch as hvd + hvd.init() + +2. Pin each GPU to a single process. + + With the typical setup of one GPU per process, set this to *local rank*. The first process on + the server will be allocated the first GPU, the second process will be allocated the second GPU, and so forth. + + + devid = hvd.local_rank() + torch.xpu.set_device(devid) + +3. Scale the learning rate by the number of workers. + + Effective batch size in synchronous distributed training is scaled by the number of workers. + An increase in learning rate compensates for the increased batch size. + +4. Wrap the optimizer in ``hvd.DistributedOptimizer``. + + The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using *allreduce* or *allgather*, and then applies those averaged gradients. + +5. Broadcast the initial variable states from rank 0 to all other processes: + + + hvd.broadcast_parameters(model.state_dict(), root_rank=0) + hvd.broadcast_optimizer_state(optimizer, root_rank=0) + + This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint. + +6. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them. + + Accomplish this by guarding model checkpointing code with ``hvd.rank() != 0``. + + +Example: + + + import torch + import intel_extension_for_pytorch + import horovod.torch as hvd + + # Initialize Horovod + hvd.init() + + # Pin GPU to be used to process local rank (one GPU per process) + devid = hvd.local_rank() + torch.xpu.set_device(devid) + device = "xpu:{}".format(devid) + + # Define dataset... + train_dataset = ... + + # Partition dataset among workers using DistributedSampler + train_sampler = torch.utils.data.distributed.DistributedSampler( + train_dataset, num_replicas=hvd.size(), rank=hvd.rank()) + + train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler) + + # Build model... + model = ... + model.to(device) + + optimizer = optim.SGD(model.parameters()) + + # Add Horovod Distributed Optimizer + optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters()) + + # Broadcast parameters from rank 0 to all other processes. + hvd.broadcast_parameters(model.state_dict(), root_rank=0) + + for epoch in range(100): + for batch_idx, (data, target) in enumerate(train_loader): + optimizer.zero_grad() + output = model(data) + loss = F.nll_loss(output, target) + loss.backward() + optimizer.step() + if batch_idx % args.log_interval == 0: + print('Train Epoch: {} [{}/{}]\tLoss: {}'.format( + epoch, batch_idx * len(data), len(train_sampler), loss.item())) + diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/int8_overview_xpu.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/int8_overview_xpu.md.txt new file mode 100644 index 000000000..753139f5f --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/int8_overview_xpu.md.txt @@ -0,0 +1,101 @@ +Intel® Extension for PyTorch\* Optimizations for Quantization [GPU] +=================================================================== + +Intel® Extension for PyTorch\* currently supports imperative mode and TorchScript mode for post-training static quantization on GPU. This section illustrates the quantization workflow on Intel GPUs. + +The overall view is that our usage follows the API defined in official PyTorch. Therefore, only small modification like moving model and data to GPU with `to('xpu')` is required. We highly recommend using the TorchScript for quantizing models. With graph model created via TorchScript, optimization like operator fusion (e.g. `conv_relu`) is enabled automatically. This delivers the best performance for int8 workloads. + +## Imperative Mode +```python +import torch +import intel_extension_for_pytorch + +# Define model +model = Model().to("xpu") +model.eval() +modelImpe = torch.quantization.QuantWrapper(model) + +# Define QConfig +qconfig = torch.quantization.QConfig(activation=torch.quantization.observer.MinMaxObserver .with_args(qscheme=torch.per_tensor_symmetric), + weight=torch.quantization.default_weight_observer) # weight could also be perchannel + +modelImpe.qconfig = qconfig + +# Prepare model for inserting observer +torch.quantization.prepare(modelImpe, inplace=True) + +# Calibration to obtain statistics for Observer +for data in calib_dataset: + modelImpe(data) + +# Convert model to create a quantized module +torch.quantization.convert(modelImpe, inplace=True) + +# Inference +modelImpe(inference_data) +``` + +Imperative mode usage follows official Pytorch and more details can be found at [PyTorch doc](https://pytorch.org/docs/1.9.1/quantization.html). + +Defining the quantized config (QConfig) for model is the first stage of quantization. Per-tensor quantization is supported for activation quantization, while both per-tensor and per-channel are supported for weight quantization. Weight can be quantized to `int8` data type only. As for activation quantization, both symmetric and asymmetric are supported. Also, both `uint8` and `int8` data types are supported. + +If the best performance is desired, we recommend using the `symmetric+int8` combination. Other configuration may have lower performance due to the existence of `zero_point`. + +After defining a QConfig, the `prepare` function is used to insert observer in models. The observer is responsible for collecting statistics for quantization. A calibration stage is needed for observer to collect info. + +After calibration, function `convert` would quantize weight in module and swap FP32 module to quantized ones. Then, an int8 module is created. Be free to use it for inference. + +## TorchScript Mode +```python +import torch +import intel_extension_for_pytorch +from torch.quantization.quantize_jit import ( + convert_jit, + prepare_jit, +) + +# Define model +model = Model().to("xpu") +model.eval() + +# Generate a ScriptModule +modelJit = torch.jit.trace(model, example_input) # or torch.jit.script(model) + +# Defin QConfig +qconfig = torch.quantization.QConfig( + activation=torch.quantization.observer.MinMaxObserver.with_args( + qscheme=qscheme, + reduce_range=False, + dtype=dtype + ), + weight=torch.quantization.default_weight_observer +) + +# Prepare model for inserting observer +modelJit = prepare_jit(modelJit, {'': qconfig}, inplace=True) + +# Calibration +for data in calib_dataset: + modelJit(data) + +# Convert model to quantized one +modelJit = convert_jit(modelJit) + +# Warmup to fully trigger fusion patterns +for i in range(5): + modelJit(warmup_data) +# Inference +modelJit(inference_data) + +# Debug +print(modelJit.graph_for(inference_dta)) +``` + +We need to define `QConfig`` for TorchScript module, use `prepare_jit` for inserting observer and use `convert_jit` for replacing FP32 modules. + +Before `prepare_jit`, create a ScriptModule using `torch.jit.script` or `torch.jit.trace`. `jit.trace` is recommended for capable of catching the whole graph in most scenarios. + +Fusion operations like `conv_unary`, `conv_binary`, `linear_unary` (e.g. `conv_relu`, `conv_sum_relu`) are automatically enabled after model conversion (`convert_jit`). A warmup stage is required for bringing the fusion into effect. With the benefit from fusion, ScriptModule can deliver better performance than eager mode. Hence, we recommend using ScriptModule as for performance consideration. + +`modelJit.graph_for(input)` is useful to dump the inference graph and other graph related information for performance analysis. + diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/ipex_log.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/ipex_log.md.txt new file mode 100644 index 000000000..1e468eb35 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/ipex_log.md.txt @@ -0,0 +1,84 @@ +`IPEX_LOG` (Prototype) +========================== + +## Introduction + +`IPEX_LOG` provides the capability to log verbose information from Intel® Extension for PyTorch\* . Please use `IPEX_LOG` to get the log information or trace the execution from Intel® Extension for PyTorch\*. Please continue using PyTorch\* macros such as `TORCH_CHECK`, `TORCH_ERROR`, etc. to get the log information from PyTorch\*. + +## `IPEX_LOG` Definition +### Log Level +The supported log levels are defined as follows, default log level is `DISABLED`: + +| log level | number | usage | +| :----: | :----: | :----: | +| DISABLED | -1 | Disable the logging | +| TRACE | 0 | Reserve for further usage | +| DEBUG | 1 | Provide the whole calling stack info | +| INFO | 2 | Record calling info to other library functions and environment variable settings | +| WARN | 3 | Warn the second attempt of an action, such as memory reallocation | +| ERR | 4 | Report error in try catch | +| CRITICAL | 5 | Reserve for further usage | + +### Log Component +Log component is used to specify which part from Intel® Extension for PyTorch\* does this log information belong to. The supported log components are defined as follows: + +| log component | description | +| :----: | :----: +| OPS | Launch SYCL, oneDNN, oneMKL operators | +| SYNGRAPH | Syngraph related | +| MEMORY | Allocate/Free memory, Allocate/Free cache | +| RUNTIME | Device / Queue related | +| ALL | All output log | + +## Usage in C++ +All the usage are defined in `utils/LogUtils.h`. Currently Intel® Extension for PyTorch\* supports: + +### Simple Log +You can use `IPEX_XXX_LOG`, XXX represents the log level as mentioned above. There are four parameters defined for simple log: +- Log component, representing which part of Intel® Extension for PyTorch\* does this log belong to. +- Log sub component, input an empty string("") for general usages. For `SYNGRAPH` you can add any log sub componment. +- Log message template format string, same as fmt_string in lib fmt, `{}` is used as a place holder for format args . +- Log args for template format string, args numbers should be aligned with size of `{}`s. + +Below is an example for using simple log inside abs kernel: + +``` c++ + +IPEX_INFO_LOG("OPS", "", "Add a log for inside ops {}", "abs"); + +``` +### Event Log +Event log is used for recording a whole event, such as an operator calculation. The whole event is identified by an unique `event_id`. You can also mark each step by using `step_id`. Use `IPEX_XXX_EVENT_END()` to complete the logging of the whole event. `XXX` represents the log level mentioned above. It will be used as the log level for all logs within one single log event. + +Below is an example for using event log: + +```c++ +IPEX_EVENT_LOG("OPS", "", "record_avg_pool", "start", "Here record the time start with arg:{}", arg); +prepare_data(); +IPEX_EVENT_LOG("OPS", "", "record_avg_pool", "data_prepare_finish", "Here record the data_prepare_finish with arg:{}", arg); +avg_pool(); +IPEX_INFO_EVENT_END("OPS", "", "record_avg_pool", "finish conv", "Here record the end"); +``` + +## Enviornment settings +Intel® Extension for PyTorch\* provides five enviornment variables for configuring log output: + +- `IPEX_LOG_LEVEL`, accept integar or string, default is -1 for `DISABLED`. +- `IPEX_LOG_COMPONENT`, accept string, used for specifying the log component and sub log component you would like to log, default is "ALL". The log component and sub log component are separated by `/`. You could also specify several log components, such as "OPS;MEMORY". +- `IPEX_LOG_OUTPUT`, accept string. If you are using `IPEX_LOG_OUTPUT`, than all the logs will recorded inside a file rather than the console. Example: export IPEX_LOG_OUTPUT="./ipex.log". +- `IPEX_LOG_ROTATE_SIZE`, accept integar, default is 10. Can be used only with `IPEX_LOG_OUTPUT`, for specifing how large file will be used when rotating this log, size is MB. +- `IPEX_LOG_SPLIT_SIZE`, accept integar, default = null. Can be used only with `IPEX_LOG_OUTPUT`, for specifing how large file will be used when splitting the logs, size is MB. + +## Usage in python +- `torch.xpu.set_log_level(log_level)` and `torch.xpu.get_log_level()`, these two functions are used for getting and setting the log level. +- `torch.xpu.set_log_output_file_path(log_path)` and `torch.xpu.get_log_output_file_path()`, these two functions are used for getting and setting the log output file path, once log output file path is set, logs will be recorded in file only. +- `torch.xpu.set_log_rotate_file_size(file size)` and `torch.xpu.get_log_rotate_file_size()`, these two functions are used for getting and setting the log rotate file size. Can be used when output file path is set. +- `torch.xpu.set_log_split_file_size(file size)` and `torch.xpu.get_log_split_file_size()`, these two functions are used for getting and setting the log split file size. Can be used when output file path is set. +- `torch.xpu.set_log_component(log_component)`, and `torch.xpu.get_log_component()`, these two functions are used for getting and setting the log component. The log component string are the same as defined in enviornment settings. + +## Replace `IPEX_SIMPLE_TRACE` +Use `torch.xpu.set_log_level(0)` to get logs to replace the previous usage in `IPEX_SIMPLE_TRACE`. + +## Replace `IPEX_VERBOSE` +Use `torch.xpu.set_log_level(1)` to get logs to replace the previous usage in `IPEX_VERBOSE`. + diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/nhwc.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/nhwc.md.txt new file mode 100644 index 000000000..971a8b665 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/nhwc.md.txt @@ -0,0 +1,254 @@ +Channels Last +============= + +## What is Channels Last + +**Note**: In PyTorch, **memory format** refers to data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space. **Memory format** has the same semantic meaning as **layout** in oneDNN. **Layout** in PyTorch has other semantic of describing **dense** or **sparse** with the attributes: 'torch.strided', 'torch.sparse_coo'. + +On CNN models, the canonical order of tensor dimensions is assigned with semantic meaning. For example the input tensor of 2D convolution is of NCHW by default on PyTorch - . NHWC is an alternative way of describing the tensor dimensions - . + +Look at the following image of illustrating NCHW and NHWC when N=1. Actually when N=1, NHWC has the same format with BMP file image. +![fig-1-memory-layout](../../images/channels_last/figure1_memory_layout.png) + +PyTorch refers to NCHW as `torch.contiguous_format` (the default memory format) and to NHWC as `torch.channels_last`, which is a new feature as of the 1.5 release. + +TensorFlow uses NHWC as the default memory format because NHWC has a performance advantage over NCHW. On Intel® platforms, we propose to optimize Channels Last memory path for the following reasons: +* **Performance** - NHWC performance is not as good as blocked memory format (nChw16c), but it is close, and much better performance than NCHW. +* **User Experience** - Operator coverage of NHWC would be higher than blocked memory format, so user experience is better. To be specific, it is difficult to enable operators that manipulates `dim` on blocked format such as `sum(dim=?)`. You would need to convert tensor from blocked memory format back to NHWC using `to_dense()`, before feeding it into `sum()`. This is naturally supported on Channels Last memory format already. +* **Upstream** - Will be easier since CPU doesn't hold secret ingredient and both inference and training will be covered. + +## Memory Format Is All That Matters + +On CNN models, memory format is almost the foundation of any upper level design. One important fact is that converting memory format could be very expensive. Thus, in case that multiple CNN operators are performed in sequence, e.g. `Conv2d -> ReLU -> Conv2d`, it's beneficial to transform them from different memory formats once, do computation and reorder them back. + +On PyTorch, you can use 3 types of memory formats on CNN models: + +### a. NCHW (default) + +```python +device='cpu' # or 'xpu' +if device == 'xpu': + import intel_extension_for_pytorch + +## NB: internally blocked format will still be used. +## aka. we do 'reorder' for 'input', 'weight' and 'output', +## and believe me this is expensive, roughly 50% perf loss... +input = torch.randn(1, 10, 32, 32).to(device) +model = torch.nn.Conv2d(10, 20, 1, 1).to(device) +output = model(input) +``` + +### b. NHWC + +```python +device='cpu' # or 'xpu' +if device == 'xpu': + import intel_extension_for_pytorch + +input = torch.randn(1, 10, 32, 32).to(device) +model = torch.nn.Conv2d(10, 20, 1, 1).to(device) +## NB: convert to Channels Last memory format. +## oneDNN supports NHWC for feature maps (input, output), +## but weight still needs to be of blocked format. +## Still we can save reorders for feature maps. +input = input.to(memory_format=torch.channels_last) +model = model.to(memory_format=torch.channels_last) +output = model(input) +``` + +### c. Blocked (nChw16c, on CPU) + +```python +from torch.utils import mkldnn as mkldnn_utils +input = torch.randn(1, 10, 32, 32) +model = torch.nn.Conv2d(10, 20, 1, 1) +## NB: convert to blocked memory format. +## Note that 'output' is in blocked memory format, +## in case the subsequent operator doesn't support blocked memory format +## you need to manually reorder it back to NCHW by output.to_dense() +## mkldnn_utils.to_mkldnn(model) is used to prepack the weight, this will save weight reorder time +## for inference. For training, it is not needed. +input = input.to_mkldnn() +model = mkldnn_utils.to_mkldnn(model) +output = model(input) +``` + +Better to explain the concepts here with a diagram, the **dotted lines** indicate simple memory view, no hard copy. +![fig-2(1)-pt-conv-layout-path-dispatch](../../images/channels_last/figure2_dispatch.png) + +**Conclusion** is that NHWC path saves the reorders from feature maps compared with NCHW path, but still weight reorder is necessary since oneDNN requires weights to be in blocked memory format. From performance perspective, when `batch_size=N`, weight reorder is minimum compared to feature map reorder. But when `batch_size=1`, weight reorder is usually not negligible. So whether to enable weight prepacking on channels last memory format needs further discussion. + +## PyTorch Strided Layout + +Before moving on, let's explain how PyTorch organizes tensors in memory - the **layout**. Here we only focus on **dense** tensors, skipping 'coo' layout of **sparse** tensor. + +The question itself can be reinterpreted as, for a tensor of size , how does PyTorch access the element with index from memory? The answer is **stride**: + +```python +tensor: +index: +strides: +offset(n,c,h,w) = stride_n * n + stride_c * c + stride_h * h + stride_w * w + = CHW * n + HW * c + W * h + 1 * w +``` + +One merit of introducing **stride** is that it can express noncontiguous tensors, e.g. a slice of big tensor. For example, the 'Xs' in the following image have a stride of . + +![fig-3-pytorch-strided-layout](../../images/channels_last/figure3_strided_layout.png) + +Keep in mind that PyTorch Tensor does not have an attribute called 'memory_format' or something else. The memory format expression completely relies on **size** and **stride**. The design principle can be found at reference: [RFC: Memory format (aka layout aka NHWC) support](https://github.com/pytorch/pytorch/issues/19092). No matter what the tensor's memory format is, we need a logical canonical order for the dimensions - that is **NCHW** on PyTorch. Thus, **size** and **stride** are ALWAYS described in the order of **NCHW**. Let's now look at the Channels Last case of the previous question: +```python +tensor: +index: +strides: +offset(n,c,h,w) = stride_n * n + stride_c * c + stride_h * h + stride_w * w + = HWC * n + 1 * c + WC * h + C * w +``` + +Actually, this pattern applies to ALL other memory formats as long as it is 4-dim, e.g. strides for CHWN would be <1, HWN, WN, N>. + +## Channels Last Memory Format APIs + +### a. tensor creation +```python +device='cpu' # or 'xpu' +if device == 'xpu': + import intel_extension_for_pytorch + +x = torch.empty(N, C, H, W, memory_format=torch.channels_last).to(device) +``` + +### b. tensor conversion +```python +device='cpu' # or 'xpu' +if device == 'xpu': + import intel_extension_for_pytorch + +## .contiguous() transforms NHWC noncontiguous to NHWC contiguous. +## .to() converts NCHW tensor to NHWC one, it is outplace. +x = x.to(device) +x = x.contiguous(memory_format=torch.channels_last) +x = x.to(memory_format=torch.channels_last) + +## contiguous check +x.is_contiguous(memory_format=torch.channels_last) +``` + +### c. model conversion +```python +device='cpu' # or 'xpu' +if device == 'xpu': + import intel_extension_for_pytorch + +## NB: tensor.to() is an outplace operation +## model.to() is inplace. It calls _apply() which is inplace. +model = model.to(device).to(memory_format=torch.channels_last) +input = input.to(device).to(memory_format=torch.channels_last) +``` + +### d. operator coverage in PyTorch + +Detailed operator coverage information has been listed at reference [Operators-with-Channels-Last-support](https://github.com/pytorch/pytorch/wiki/Operators-with-Channels-Last-support). + +Some spontaneous questions: +* **How to tell whether this model or operator support Channels Last?** - This requires manual memory format check, aka. 'torch.channels_last' input and weight shall NOT generate 'torch.contiguous_format' output. +* **What if the model comprises of operator not supported Channels Last?** - No errors messages will be shown, the NHWC tensor will be handled by the operator as a non-contiguous NCHW tensor, so result might not be correct depending on the algorithm of this operator. + +## Writing Channels Last Kernels on CPU + +### a. Register Channels Last Kernel in ATen Native Manner + +The general guideline has been listed under reference [Writing-memory-format-aware-operators](https://github.com/pytorch/pytorch/wiki/Writing-memory-format-aware-operators), not to repeat here. You may take one of my recent PR [optimize upsample performance linear mode on CPU](https://github.com/pytorch/pytorch/pull/34864) as an example, which also demonstrates NHWC performance advantage over NCHW because of the ease of vectorization. + +### b. Register oneDNN Kernel on Channels Last + +Registering a oneDNN kernel under Channels Last memory format on CPU is no different from [cuDNN](https://github.com/pytorch/pytorch/pull/23861): Only very few upper level changes are needed, such as accommodate 'contiguous()' to 'contiguous(suggested_memory_format)'. The automatic reorder of oneDNN weight shall have been hidden in ideep. + +## oneDNN NHWC APIs + +Compared to NCHW interfaces, 2 parts need to be addressed on NHWC interfaces: + +### a. Create NHWC Memory + +The logical size and stride description of oneDNN is always in NCHW, this is identical to PyTorch. Example code such as +```cpp +/* create md from memory::format_tag */ +auto src_md = memory::desc( + {N, C, H, W}, // logical dims, the order is defined by a primitive + memory::data_type::f32, // tensor's data type + memory::format_tag::nhwc // memory format, NHWC in this case +); + +/* alternative: create md from strides */ +auto src_md = memory::desc( + {N, C, H, W}, // logical dims, the order is defined by a primitive + memory::data_type::f32, // tensor's data type + {stride_N, stride_C, stride_H, stride_W} // the strides +); + +/* create memory */ +auto src_mem = memory(src_md, src_data_ptr, engine); +``` + +### b. Create Convolution Primitive + +* **NCHW** - create `memory::desc` with *any* card for 'input', 'output' and 'weight'; query proposed `memory::desc` from convolution primitive; +* **NHWC** - create `memory::desc` with `format_tag::nhwc` for 'input' and 'output', use *any* for 'weight'; if we use `hwio` for 'weight' convolution primitive will be created with gemm rather jit avx512. + +## Channels Last 1D support on XPU + +Both stock PyTorch and Intel® Extension for PyTorch\* support Channels Last(2D) and Channels Last 3D, however, regarding Channels Last 1D, they are different. Stock PyTorch doesn't support Channels Last 1D, while XPU could supply limited support for Channels Last 1D. +We only support Channels Last 1D memory format in these operators: Conv1D, BatchNorm1D, MaxPool1D, Concat, binary add, binary div, upsample linear and upsample nearest. + +The usage of Channels Last 1D on XPU is different from stock PyTorch Channels Last(2D) or Channels Last 3D. We use torch.xpu.to_channels_last_1d() to do conversation for both input tensor and model. See below: + +```python +import torch +import intel_extension_for_pytorch + +sycl_device = torch.device("xpu") + + +class Model(torch.nn.Module): + def __init__(self): + super(Model, self).__init__() + self.block = torch.nn.Sequential( + torch.nn.Conv1d(3, 3, kernel_size=3, stride=1, padding=1, bias=False), + torch.nn.BatchNorm1d(3) + ) + + def forward(self, x): + x = self.block(x) + return x + + +model = Model() +test_input = torch.rand([2, 3, 4]) +test_input_xpu = test_input.to(sycl_device) +test_input_xpu = torch.xpu.to_channels_last_1d(test_input_xpu) # Channels Last 1D conversation for tenor +model = model.to(sycl_device) +model = torch.xpu.to_channels_last_1d(model) # Channels Last 1D conversation for mode +xpu_res = model(test_input_xpu) + +print(torch.xpu.is_contiguous_channels_last_1d(xpu_res)) +``` + +### a. tensor conversion with Channels Last 1D + +```python +input_xpu = torch.xpu.to_channels_last_1d(input_xpu) +``` + +### b. model conversion with Channels Last 1D + +```python +model = torch.xpu.to_channels_last_1d(model) +``` + +### c. determine if in Channels Last 1D memory format + +```python +print(torch.xpu.is_contiguous_channels_last_1d(input)) +``` + +Note that because Meta doesn't support Channels Last 1D feature now: [RFC: A suggestion of channels last memory format implementation for 3D tensor](https://github.com/pytorch/pytorch/issues/74935), expect Channels Last 1D APIs above, other APIs from stock PyTorch may be invalid. E.g.: If you want to use memory format corrsponding API for Channels Last 1D, it cannot work as you wish. diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/profiler_kineto.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/profiler_kineto.md.txt new file mode 100644 index 000000000..d49ac4402 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/profiler_kineto.md.txt @@ -0,0 +1,155 @@ +Kineto Supported Profiler Tool (Prototype) +========================================== + +## Introduction + +The Kineto supported profiler tool is an extension of PyTorch\* profiler for profiling operators' executing time cost on GPU devices. With this tool, you can get information in many fields of the run models or code scripts. Build Intel® Extension for PyTorch\* with Kineto support as default and enable this tool using the `with` statement before the code segment. + +## Use Case + +To use the Kineto supported profiler tool, you need to build Intel® Extension for PyTorch\* from source or install it via prebuilt wheel. You also have various methods to disable this tool. + +### Build Tool + +The build flag `USE_PTI` is default ON for Intel® Extension for PyTorch\* to enable the PTI-based Kineto Profiler. Before building, you need to make sure the PTI-SDK is well preinstalled and sourced in your env. +Here is the command you can use to download the PTI-SDK onto your machine: `wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/5987ec30-be32-4dee-870f-7b97a1113488/l_intel-pti-dev_p_0.9.0.33_offline.sh`. After downloading the file, you need to install it by running `sh l_intel-pti-dev_p_0.9.0.33_offline.sh`. After you install the PTI, you need to run `source /pti/latest/env/vars.sh` to source the PTI, then you can start your building. + +### Use Tool + +#### Add Profiler Into Script + +All the usages are aligned with the official PyTorch\* suggested. Please refer to [PyTorch\*'s tutorial page](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) for the first step. + +In your model script, write `with` statement to enable the Kineto supported profiler tool ahead of your code snippets, as shown in the following example: + +```python +# import all necessary libraries +import torch +from torch.profiler import profile, ProfilerActivity +import intel_extension_for_pytorch + +# these lines won't be profiled before enabling profiler tool +input_tensor = torch.randn(1024, dtype=torch.float32, device='xpu:0') + +# enable Kineto supported profiler tool with a `with` statement +with profile(activities=[ProfilerActivity.CPU, + ProfilerActivity.XPU]) as prof: + # do what you want to profile here after the `with` statement with proper indent + output_tensor_1 = torch.nonzero(input_tensor) + output_tensor_2 = torch.unique(input_tensor) + +# print the result table formatted by the profiler tool as your wish +print(prof.key_averages().table()) +``` + +In your model script, you can also assign a schedule for profile loops of iterations, as shown in the following example: + +```python +from torch.profiler import schedule + +# assign a customized schedule +my_schedule = schedule( + skip_first=10, + wait=1, + warmup=3, + active=1, + repeat=2) + +# also define a handler for outputing results +def trace_handler(p): + print(p.key_averages().table(sort_by="self_xpu_time_total", row_limit=10) + p.export("/tmp/trace_" + str(p.step_num) + ".json") + +# pass customized schedule and trace handler to profiler outside the for-loop +with profile(activities=[ProfilerActivity.CPU, + ProfilerActivity.XPU], + schedule=my_schedule, + on_trace_ready=trace_handler) as prof: + for iter in range(len(dataloader)): + model(input) + # don't forget a step() at the end of each loop + prof.step() +``` + +There are a number of useful parameters defined in `torch.profiler.profile`. Many of them are aligned with usages defined in PyTorch\*'s official profiler, such as `record_shapes`, a very useful parameter to control whether to record the shape of input tensors for each operator. To enable Kineto supported profiler on XPU backend, remember to add `torch.profiler.ProfilerActivity.XPU` into the list of activities. For the usage of more parameters, please refer to [PyTorch\*'s API reference](https://pytorch.org/docs/stable/profiler.html#module-torch.profiler). + +#### Disable Tool in Model Script + +To disable this profiler tool in your model script, you must remove those profiler related code as PyTorch\* doesn't offer a switch in `torch.profiler.profile` API. To reduce effort to switch the profiler on and off, it is suggested to use `contextlib` for control like below: + +```python +import contextlib + +def profiler_setup(profiling=False, *args, **kwargs): + if profiling: + return torch.profiler.profile(*args, **kwargs) + else: + return contextlib.nullcontext() + +# you can pass official arguments as normal +with profiler_setup(profiling=should_profile, + activities=[ProfileActivity.XPU], + schedule=my_schedule, + on_trace_ready=trace_handler) as prof: + for iter in range((len(dataloader)): + model(input) + + if should_profile: + prof.step() +``` + +#### Profile on Multi-device Application + +Follow typical usages for profiling multi-device application. Explicitly call `torch.xpu.synchronize(device_id)` for all involved devices. Such as: + +```python +# Run this example, please make sure you have more than one device. +assert torch.xpu.device_count() > 1, "This example need more than one device existed." + +# put first input on device "xpu:0" +a_0 = torch.randn(100).to(torch.device("xpu:0")) +# put second input on device "xpu:1" +a_1 = torch.randn(100).to(torch.device("xpu:1")) + +# Start profiler as normal +with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.XPU]) as prof: + # run kernel on "xpu:0" + b_0 = a_0 + a_0 + # run kernel on "xpu:1" + b_1 = a_1 + a_1 + # explicitly synchronize all involved devices + torch.xpu.synchronize(torch.device("xpu:0")) + torch.xpu.synchronize(torch.device("xpu:1")) + +# You may check kernels on difference devices from chrome trace +prof.export_chrome_trace("trace_example_on_multi_device.json") +``` + +### Result + +Using the first script shown above in **Use Tool** part, you'll see the result table printed out to the console as below: + +![Kiento_profiler_result_console](../../images/profiler_kineto/profiler_kineto_result_console.png) + +In this result, you can find several fields including: + +- `Name`: the name of run operators, runtime functions or kernels. +- `Self CPU %`, `Self CPU`: the time consumed by the operator itself at host excluded its children operator call. The column marked with percentage sign shows the propotion of time to total self cpu time. While an operator calls more than once in a run, the self cpu time may increase in this field. +- `CPU total %`, `CPU total`: the time consumed by the operator at host included its children operator call. The column marked with percentasge sign shows the propotion of time to total cpu time. While an operator calls more than once in a run, the cpu time may increase in this field. +- `CPU time avg`: the average time consumed by each once call of the operator at host. This average is calculated on the cpu total time. +- `Self XPU`, `Self XPU %`: similar to `Self CPU (%)` but shows the time consumption on XPU devices. +- `XPU total`: similar to `CPU total` but shows the time consumption on XPU devices. +- `XPU time avg`: similar to `CPU time avg` but shows average time sonsumption on XPU devices. This average is calculated on the XPU total time. +- `# of Calls`: number of call for each operators in a run. + +### Export to Chrome Trace + +You can export the result to a json file and then load it in the Chrome trace viewer (`chrome://tracing`) or Perfetto viewer (`ui.perfetto.dev`) by adding this line in your model script: + +```python +prof.export_chrome_trace("trace_file.json") +``` + +You can examine the sequence of profiled operators, runtime functions and XPU kernels in these trace viewers. Here shows a trace result for ResNet50 run on XPU backend viewed by Perfetto viewer: + +![profiler_kineto_result_perfetto_viewer](../../images/profiler_kineto/profiler_kineto_result_perfetto_viewer.png) diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/profiler_legacy.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/profiler_legacy.md.txt new file mode 100644 index 000000000..d60962619 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/profiler_legacy.md.txt @@ -0,0 +1,6 @@ +Legacy Profiler Tool (Deprecated) +================================ + +## Introduction + +The legacy profiler tool will be deprecated from Intel® Extension for PyTorch* very soon. Please use [Kineto Supported Profiler Tool](./profiler_kineto.md) instead for profiling operators' executing time cost on Intel® GPU devices. diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/simple_trace.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/simple_trace.md.txt new file mode 100644 index 000000000..92671e646 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/simple_trace.md.txt @@ -0,0 +1,103 @@ +Simple Trace Tool (Deprecated) +============================== + +## Introduction + +Simple Trace is a built-in debugging tool that lets you control printing out the call stack for a piece of code. You can enable this tool and have it automatically print out verbose messages of called operators in a stack format with indenting to distinguish the context. You can enable and disable this tool using a simple method. + +## Use Case + +To use the simple trace tool, you need to build Intel® Extension for PyTorch\* from source and add explicit calls to enable and disable tracing in your model script. When enabled, the trace messages will be printed to the console screen by default, along with verbose log messages. + +### Enable and Disable Tool + +IPEX_SIMPLE_TRACE can be used to turn ON/OFF simple trace. It is set as 0 by default. You can set 1 to enable simple trace for all operators: + +```python +export IPEX_SIMPLE_TRACE=1 +``` + +### Use Simple Trace in Model + +In your model script, bracket code in your model script with calls to `torch.xpu.enable_simple_trace()` and `torch.xpu.disable_simple_trace()`, as shown in the following example: + +```python +# import all necessary libraries +import torch +import intel_extension_for_pytorch + +print(torch.xpu.using_simple_trace()) # False +a = torch.randn(100).xpu() # this line won't be traced + +torch.xpu.enable_simple_trace() # to enable simple trace tool + +# test code (with tracing enabled) begins here +b = torch.randn(100).xpu() +c = torch.unique(b) +# test code ends here + +torch.xpu.disable_simple_trace() # to disable simple trace tool +``` + +The simple trace output will start after being enabled, and will continue until +the call to disable it, so be careful with your model script logic so the disable call is +not unintentionally skipped. + +### Results + +Using the script shown above as the exmaple, you'll see these messages printed out to the console: + +```text +[262618.262618] Call into OP: wrapper__empty_strided -> at::AtenIpexTypeXPU::empty_strided (#0) +[262618.262618] Step out of OP: wrapper__empty_strided -> at::AtenIpexTypeXPU::empty_strided (#0) +[262618.262618] Call into OP: wrapper__copy_ -> at::AtenIpexTypeXPU::copy_ (#1) +[262618.262618] Step out of OP: wrapper__copy_ -> at::AtenIpexTypeXPU::copy_ (#1) +[262618.262618] Call into OP: wrapper___unique2 -> at::AtenIpexTypeXPU::_unique2 (#2) +[262618.262618] Call into OP: wrapper__clone -> at::AtenIpexTypeXPU::clone (#3) +[262618.262618] Call into OP: wrapper__empty_strided -> at::AtenIpexTypeXPU::empty_strided (#4) +[262618.262618] Step out of OP: wrapper__empty_strided -> at::AtenIpexTypeXPU::empty_strided (#4) +[262618.262618] Call into OP: wrapper__copy_ -> at::AtenIpexTypeXPU::copy_ (#5) +[262618.262618] Step out of OP: wrapper__copy_ -> at::AtenIpexTypeXPU::copy_ (#5) +[262618.262618] Step out of OP: wrapper__clone -> at::AtenIpexTypeXPU::clone (#3) +[262618.262618] Call into OP: wrapper___reshape_alias -> at::AtenIpexTypeXPU::_reshape_alias (#6) +[262618.262618] Step out of OP: wrapper___reshape_alias -> at::AtenIpexTypeXPU::_reshape_alias (#6) +[262618.262618] Call into OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#7) +[262618.262618] Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#7) +[262618.262618] Call into OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#8) +[262618.262618] Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#8) +[262618.262618] Call into OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#9) +[262618.262618] Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#9) +[262618.262618] Call into OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#10) +[262618.262618] Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#10) +[262618.262618] Call into OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#11) +[262618.262618] Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#11) +[262618.262618] Call into OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#12) +[262618.262618] Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#12) +[262618.262618] Call into OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#13) +[262618.262618] Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#13) +[262618.262618] Call into OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#14) +[262618.262618] Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#14) +[262618.262618] Call into OP: wrapper__as_strided -> at::AtenIpexTypeXPU::as_strided (#15) +[262618.262618] Step out of OP: wrapper__as_strided -> at::AtenIpexTypeXPU::as_strided (#15) +[262618.262618] Call into OP: wrapper___local_scalar_dense -> at::AtenIpexTypeXPU::_local_scalar_dense (#16) +[262618.262618] Step out of OP: wrapper___local_scalar_dense -> at::AtenIpexTypeXPU::_local_scalar_dense (#16) +[262618.262618] Call into OP: wrapper__as_strided -> at::AtenIpexTypeXPU::as_strided (#17) +[262618.262618] Step out of OP: wrapper__as_strided -> at::AtenIpexTypeXPU::as_strided (#17) +[262618.262618] Call into OP: wrapper___local_scalar_dense -> at::AtenIpexTypeXPU::_local_scalar_dense (#18) +[262618.262618] Step out of OP: wrapper___local_scalar_dense -> at::AtenIpexTypeXPU::_local_scalar_dense (#18) +[262618.262618] Call into OP: wrapper__resize_ -> at::AtenIpexTypeXPU::resize_ (#19) +[262618.262618] Step out of OP: wrapper__resize_ -> at::AtenIpexTypeXPU::resize_ (#19) +[262618.262618] Step out of OP: wrapper___unique2 -> at::AtenIpexTypeXPU::_unique2 (#2) +[262618.262618] Call into OP: wrapper__copy_ -> at::AtenIpexTypeXPU::copy_ (#20) +[262618.262618] Step out of OP: wrapper__copy_ -> at::AtenIpexTypeXPU::copy_ (#20) +``` + +The meanings of each field are shown as below: + +- `pid.tid`, `[262618.262618]`: the process id and the thread id responsible to the printed-out line. +- `behavior`, `Call into OP`, `Step out of OP`: call-in or step-out behavior of the operators in a run. +- `name1 -> name2`, `wrapper__empty_strided -> at::AtenIpexTypeXPU::empty_strided`: the calling operator for the current step. The name1 before the arrow shows the wrapper from PyTorch. The name2 after the arrow shows the function name of which was called in or stepped out in Intel® Extension for PyTorch\* at the current step. +- `(#No.)`, `(#0)`: index of the called operators. This index was numbered from 0 in the order of each operator to be called. +- `indent`: the indent ahead of every behavior shows the nested relationship between operators. The operator call-in line with more indent should be a child of what was called above it. + +With this output, you can see the calling stack of the traced script without using complicated debug tools such as gdb. diff --git a/xpu/2.3.110+xpu/_sources/tutorials/features/torch_compile_gpu.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/features/torch_compile_gpu.md.txt new file mode 100644 index 000000000..cc2f2b785 --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/features/torch_compile_gpu.md.txt @@ -0,0 +1,78 @@ +torch.compile for GPU (Beta) +============================ + +# Introduction + +Intel® Extension for PyTorch\* now empowers users to seamlessly harness graph compilation capabilities for optimal PyTorch model performance on Intel GPU via the flagship [torch.compile](https://pytorch.org/docs/stable/generated/torch.compile.html#torch-compile) API through the default "inductor" backend ([TorchInductor](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747/1)). The Triton compiler has been the core of the Inductor codegen supporting various accelerator devices. Intel has extended TorchInductor by adding Intel GPU support to Triton. Additionally, post-op fusions for convolution and matrix multiplication, facilitated by oneDNN fusion kernels, contribute to enhanced efficiency for computational intensive operations. Leveraging these features is as simple as using the default "inductor" backend, making it easier than ever to unlock the full potential of your PyTorch models on Intel GPU platforms. + + +# Required Dependencies + +**Verified version**: +- `torch` : v2.3 +- `intel_extension_for_pytorch` : v2.3 +- `triton` : >= v3.0.0 + + +Install [Intel® oneAPI Base Toolkit 2024.2.1](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html). + +Follow [Intel® Extension for PyTorch\* Installation](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/) to install `torch` and `intel_extension_for_pytorch` firstly. + +Triton could be directly installed using the following command: + +```Bash +pip install --pre pytorch-triton-xpu==3.0.0+1b2f15840e --index-url https://download.pytorch.org/whl/nightly/xpu +``` + +Remember to activate the oneAPI basekit by following commands. + +```bash +# {dpcpproot} is the location for dpcpp ROOT path and it is where you installed oneAPI DPCPP, usually it is /opt/intel/oneapi/compiler/latest or ~/intel/oneapi/compiler/latest +source {dpcpproot}/env/vars.sh +``` + + +# Example Usage + + +## Inferenece with torch.compile + +```python +import torch +import intel_extension_for_pytorch + +# create model +model = SimpleNet().to("xpu") + +# compile model +compiled_model = torch.compile(model, options={"freezing": True}) + +# inference main +input = torch.rand(64, 3, 224, 224, device=torch.device("xpu")) +with torch.no_grad(): + with torch.xpu.amp.autocast(dtype=torch.float16): + output = compiled_model(input) +``` + +## Training with torch.compile + +```python +import torch +import intel_extension_for_pytorch + +# create model and optimizer +model = SimpleNet().to("xpu") +optimizer = torch.optim.SGD(model.parameters(), lr=..., momentum=..., weight_decay=...) + +# compile model +compiled_model = torch.compile(model) + +# training main +input = torch.rand(64, 3, 224, 224, device=torch.device("xpu")) +with torch.xpu.amp.autocast(dtype=torch.bfloat16): + output = compiled_model(input) + loss = loss_function(output) +optimizer.zero_grad() +loss.backward() +optimizer.step() +``` diff --git a/xpu/2.3.110+xpu/_sources/tutorials/getting_started.md.txt b/xpu/2.3.110+xpu/_sources/tutorials/getting_started.md.txt new file mode 100644 index 000000000..d96d2fe7c --- /dev/null +++ b/xpu/2.3.110+xpu/_sources/tutorials/getting_started.md.txt @@ -0,0 +1,62 @@ +# Quick Start + +The following instructions assume you have installed the Intel® Extension for PyTorch\*. For installation instructions, refer to [Installation](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.30%2bxpu). + +To start using the Intel® Extension for PyTorch\* in your code, you need to make the following changes: + +1. Import the extension with `import intel_extension_for_pytorch as ipex`. +2. Move model and data to GPU with `to('xpu')`, if you want to run on GPU. +3. Invoke the `optimize()` function to apply optimizations. +3. For TorchScript, invoke `torch.jit.trace()` and `torch.jit.freeze()`. + +**Important:** It is highly recommended to `import intel_extension_for_pytorch` right after `import torch`, prior to importing other packages. + +The example below demostrates how to use the Intel® Extension for PyTorch\*: + +```python +import torch +import intel_extension_for_pytorch as ipex + +model = Model() +model.eval() # Set the model to evaluation mode for inference, as required by ipex.optimize() function. +data = ... +dtype=torch.float32 # torch.bfloat16, torch.float16 (float16 only works on GPU) + +##### Run on GPU ###### +model = model.to('xpu') +data = data.to('xpu') +####################### + +model = ipex.optimize(model, dtype=dtype) + +########## FP32 ############ +with torch.no_grad(): +####### BF16 on CPU ######## +with torch.no_grad(), torch.cpu.amp.autocast(): +##### BF16/FP16 on GPU ##### +with torch.no_grad(), torch.xpu.amp.autocast(enabled=True, dtype=dtype, cache_enabled=False): +############################ + ###### Torchscript ####### + model = torch.jit.trace(model, data) + model = torch.jit.freeze(model) + ###### Torchscript ####### + + model(data) +``` + +More examples, including training and usage of low precision data types are available at [Examples](./examples.md). + + +## Execution + +There are some environment variables in runtime that can be used to configure executions on GPU. Please check [Advanced Configuration](./features/advanced_configuration.html#runtime-configuration) for more detailed information. + +Set `OCL_ICD_VENDORS` with default path `/etc/OpenCL/vendors`. +Set `CCL_ROOT` if you are using multi-GPU. + +```bash +export OCL_ICD_VENDORS=/etc/OpenCL/vendors +export CCL_ROOT=${CONDA_PREFIX} +python + + + + + + +
+ + +
+ +
+
+
+
    +
  • + +
  • + View page source +
  • +
+
+
+
+
+ +
+

Intel® Extension for PyTorch* CPU ISA Dynamic Dispatch Design Doc

+

This document explains the dynamic kernel dispatch mechanism for Intel® Extension for PyTorch* (IPEX) based on CPU ISA. It is an extension to the similar mechanism in PyTorch.

+
+

Overview

+

IPEX dyndisp is forked from PyTorch: ATen/native/DispatchStub.h and ATen/native/DispatchStub.cpp. IPEX adds additional CPU ISA level support, such as AVX512_VNNI, AVX512_BF16 and AMX.

+

PyTorch & IPEX CPU ISA support statement:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DEFAULTAVX2AVX2_VNNIAVX512AVX512_VNNIAVX512_BF16AMXAVX512_FP16
PyTorch
IPEX-1.11
IPEX-1.12
IPEX-1.13
IPEX-2.1
IPEX-2.2

* Current IPEX DEFAULT level implemented as same as AVX2 level.

+
+

CPU ISA build compiler requirement

+

| ISA Level | GCC requirement | +| —- | —- | +| AVX2 | Any | +| AVX512 | GCC 9.2+ | +| AVX512_VNNI | GCC 9.2+ | +| AVX512_BF16 | GCC 10.3+ | +| AVX2_VNNI | GCC 11.2+ | +| AMX | GCC 11.2+ | +| AVX512_FP16 | GCC 12.1+ |

+

* Check with cmake/Modules/FindAVX.cmake for detailed compiler checks.

+
+
+
+

Dynamic Dispatch Design

+

Dynamic dispatch copies the kernel implementation source files to multiple folders for each ISA level. It then builds each file using its ISA specific parameters. Each generated object file will contain its function body (Kernel Implementation).

+

Kernel Implementation uses an anonymous namespace so that different CPU versions won’t conflict.

+

Kernel Stub is a “virtual function” with polymorphic kernel implementations pertaining to ISA levels.

+

At the runtime, Dispatch Stub implementation will check CPUIDs and OS status to determins which ISA level pointer best matches the function body.

+
+

Code Folder Struct

+
+

Kernel implementation: csrc/cpu/aten/kernels/xyzKrnl.cpp

+
+
+

Kernel Stub: csrc/cpu/aten/xyz.cpp and csrc/cpu/aten/xyz.h

+
+
+

Dispatch Stub implementation: csrc/cpu/dyndisp/DispatchStub.cpp and csrc/cpu/dyndisp/DispatchStub.h

+
+
+
+

CodeGen Process

+

IPEX build system will generate code for each ISA level with specifiy complier parameters. The CodeGen script is located at cmake/cpu/IsaCodegen.cmake.

+

The CodeGen will copy each cpp files from Kernel implementation, and then add ISA level as new file suffix.

+
+

Sample:

+
+

Origin file:

+

csrc/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp

+

Generate files:

+

DEFAULT: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.DEFAULT.cpp -O3 -D__AVX__ -DCPU_CAPABILITY_AVX2 -mavx2 -mfma -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -DCPU_CAPABILITY=DEFAULT -DCPU_CAPABILITY_DEFAULT

+

AVX2: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX2.cpp -O3 -D__AVX__ -mavx2 -mfma -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -DCPU_CAPABILITY=AVX2 -DCPU_CAPABILITY_AVX2

+

AVX512: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512.cpp -O3 -D__AVX512F__ -mavx512f -mavx512bw -mavx512vl -mavx512dq -mfma -DCPU_CAPABILITY=AVX512 -DCPU_CAPABILITY_AVX512

+

AVX512_VNNI: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_VNNI.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mfma -DCPU_CAPABILITY=AVX512_VNNI -DCPU_CAPABILITY_AVX512_VNNI

+

AVX512_BF16: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_BF16.cpp -O3 -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -DCPU_CAPABILITY=AVX512_BF16 -DCPU_CAPABILITY_AVX512_BF16

+

AMX: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AMX.cpp -O3  -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -DCPU_CAPABILITY=AMX -DCPU_CAPABILITY_AMX

+

AVX512_FP16: build/Release/csrc/isa_codegen/cpu/aten/kernels/AdaptiveAveragePoolingKrnl.cpp.AVX512_FP16.cpp -O3  -D__AVX512F__ -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI -DCPU_CAPABILITY_AVX512_BF16 -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512vnni -mavx512bf16 -mfma -mamx-tile -mamx-int8 -mamx-bf16 -mavx512fp16 -DCPU_CAPABILITY_AMX -DCPU_CAPABILITY=AVX512_FP16 -DCPU_CAPABILITY_AVX512_FP16

+
+
+
+

Note:

+
    +
  1. DEFAULT level kernels is not fully implemented in IPEX. In order to align to PyTorch, we build default use AVX2 parameters in stead of that. So, IPEX minimal required executing machine support AVX2.

  2. +
  3. -D__AVX__ and -D__AVX512F__ is defined for depends library sleef .

  4. +
  5. -DCPU_CAPABILITY_AVX512 and -DCPU_CAPABILITY_AVX2 are must to be defined for PyTorch: aten/src/ATen/cpu/vec, it determins vec register width.

  6. +
  7. -DCPU_CAPABILITY=[ISA_NAME] is must to be defined for PyTorch: aten/src/ATen/cpu/vec, it is used as inline namespace name.

  8. +
  9. Higher ISA level is compatible to lower ISA levels, so it needs to contains level ISA feature definitions. Such as AVX512_BF16 need contains -DCPU_CAPABILITY_AVX512 -DCPU_CAPABILITY_AVX512_VNNI. But AVX512 don’t contains AVX2 definitions, due to there are different vec register width.

  10. +
+
+
+
+
+

Add Custom Kernel

+

If you want to add a new custom kernel, and the kernel uses CPU ISA instructions, refer to these tips:

+
    +
  1. Add CPU ISA related kernel implementation to the folder: csrc/cpu/aten/kernels/NewKernelKrnl.cpp

  2. +
  3. Add kernel stub to the folder: csrc/cpu/aten/NewKernel.cpp

  4. +
  5. Include header file: csrc/cpu/dyndisp/DispatchStub.h, and reference to the comment in the header file.

  6. +
+
// Implements instruction set specific function dispatch.
+//
+// Kernels that may make use of specialized instruction sets (e.g. AVX2) are
+// compiled multiple times with different compiler flags (e.g. -mavx2). A
+// DispatchStub contains a table of function pointers for a kernel. At runtime,
+// the fastest available kernel is chosen based on the features reported by
+// cpuinfo.
+//
+// Example:
+//
+// In csrc/cpu/aten/MyKernel.h:
+//   using fn_type = void(*)(const Tensor& x);
+//   IPEX_DECLARE_DISPATCH(fn_type, stub);
+//
+// In csrc/cpu/aten/MyKernel.cpp
+//   IPEX_DEFINE_DISPATCH(stub);
+//
+// In csrc/cpu/aten/kernels/MyKernel.cpp:
+//   namespace {
+//     // use anonymous namespace so that different cpu versions won't conflict
+//     void kernel(const Tensor& x) { ... }
+//   }
+//   IPEX_REGISTER_DISPATCH(stub, &kernel);
+//
+// To call:
+//   stub(kCPU, tensor);
+
+
+
    +
  1. Write the kernel follow the guide. It contains: declare function type, register stub, call stub, etc.

  2. +
+
+

Note:

+
    +
  1. Some kernels only call oneDNN or iDeep implementation, or other backend implementation, which is not needed to add kernel implementations. (Refer: BatchNorm.cpp)

  2. +
  3. Vec related header file must be included in kernel implementation files, but can not be included in kernel stub. Kernel stub is common code for all ISA level, and can’t pass ISA related compiler parameters.

  4. +
  5. For more intrinsics, check the Intel® Intrinsics Guide.

  6. +
+
+
+

ISA intrinics specific kernel example:

+

This is a FP32 convert to BF16 function example, and it is implemented for AVX512_BF16, AVX512 and DEFAULT ISA levels.

+
//csrc/cpu/aten/CvtFp32ToBf16.h
+
+#pragma once
+
+#include <dyndisp/DispatchStub.h>
+
+namespace torch_ipex {
+namespace cpu {
+
+void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len);
+
+namespace {
+
+void cvt_fp32_to_bf16_kernel_impl(at::BFloat16* dst, const float* src, int len);
+
+}
+
+using cvt_fp32_to_bf16_kernel_fn = void (*)(at::BFloat16*, const float*, int);
+IPEX_DECLARE_DISPATCH(cvt_fp32_to_bf16_kernel_fn, cvt_fp32_to_bf16_kernel_stub);
+} // namespace cpu
+} // namespace torch_ipex
+
+
+
//csrc/cpu/aten/CvtFp32ToBf16.cpp
+
+#include "CvtFp32ToBf16.h"
+
+namespace torch_ipex {
+namespace cpu {
+
+IPEX_DEFINE_DISPATCH(cvt_fp32_to_bf16_kernel_stub);
+
+void cvt_fp32_to_bf16(at::BFloat16* dst, const float* src, int len) {
+  return cvt_fp32_to_bf16_kernel_stub(kCPU, dst, src, len);
+}
+
+} // namespace cpu
+} // namespace torch_ipex
+
+
+

Macro CPU_CAPABILITY_AVX512 and CPU_CAPABILITY_AVX512_BF16 are defined by compiler check, it is means that current compiler havs capability to generate defined ISA level code.

+

Because of AVX512_BF16 is higher level than AVX512, and it compatible to AVX512. CPU_CAPABILITY_AVX512_BF16 can be contained in CPU_CAPABILITY_AVX512 region.

+
//csrc/cpu/aten/kernels/CvtFp32ToBf16Krnl.cpp
+
+#include <ATen/cpu/vec/vec.h>
+#include "csrc/aten/cpu/CvtFp32ToBf16.h"
+
+namespace torch_ipex {
+namespace cpu {
+
+namespace {
+
+#if defined(CPU_CAPABILITY_AVX512)
+#include <ATen/cpu/vec/vec512/vec512.h>
+#else
+#include <ATen/cpu/vec/vec256/vec256.h>
+#endif
+using namespace at::vec;
+
+#if defined(CPU_CAPABILITY_AVX512)
+#include <immintrin.h>
+
+inline __m256i _cvt_fp32_to_bf16(const __m512 src) {
+#if (defined CPU_CAPABILITY_AVX512_BF16) // AVX512_BF16 ISA implementation.
+  return reinterpret_cast<__m256i>(_mm512_cvtneps_pbh(src));
+#else  // AVX512 ISA implementation.
+  __m512i value = _mm512_castps_si512(src);
+  __m512i nan = _mm512_set1_epi32(0xffff);
+  auto mask_value = _mm512_cmp_ps_mask(src, src, _CMP_ORD_Q);
+  __m512i ones = _mm512_set1_epi32(0x1);
+  __m512i vec_bias = _mm512_set1_epi32(0x7fff);
+  // uint32_t lsb = (input >> 16) & 1;
+  auto t_value = _mm512_and_si512(_mm512_srli_epi32(value, 16), ones);
+  // uint32_t rounding_bias = 0x7fff + lsb;
+  t_value = _mm512_add_epi32(t_value, vec_bias);
+  // input += rounding_bias;
+  t_value = _mm512_add_epi32(t_value, value);
+  // input = input >> 16;
+  t_value = _mm512_srli_epi32(t_value, 16);
+  // Check NaN before converting back to bf16
+  t_value = _mm512_mask_blend_epi32(mask_value, nan, t_value);
+  return _mm512_cvtusepi32_epi16(t_value);
+#endif
+}
+
+void cvt_fp32_to_bf16_kernel_impl(
+    at::BFloat16* dst,
+    const float* src,
+    int len) {
+  int i = 0;
+  for (; i < len - 15; i += 16) {
+    auto f32 = _mm512_loadu_ps(src + i);
+    _mm256_storeu_si256((__m256i*)(dst + i), _cvt_fp32_to_bf16(f32));
+  }
+  if (i < len) {
+    auto mask = (1 << (len - i)) - 1;
+    auto f32 = _mm512_maskz_loadu_ps(mask, src + i);
+    _mm256_mask_storeu_epi16(dst + i, mask, _cvt_fp32_to_bf16(f32));
+  }
+}
+
+#else // DEFAULT ISA implementation.
+
+void cvt_fp32_to_bf16_kernel_impl(
+    at::BFloat16* dst,
+    const float* src,
+    int len) {
+  for (int j = 0; j < len; j++) {
+    *(dst + j) = *(src + j);
+  }
+}
+
+#endif
+
+} // anonymous namespace
+
+IPEX_REGISTER_DISPATCH(cvt_fp32_to_bf16_kernel_stub, &cvt_fp32_to_bf16_kernel_impl);
+
+} // namespace cpu
+} // namespace torch_ipex
+
+
+
+
+

Vec specific kernel example:

+

This example shows how to get the data type size and its Vec size. In different ISA, Vec has a different register width and a different Vec size.

+
//csrc/cpu/aten/GetVecLength.h
+#pragma once
+
+#include <dyndisp/DispatchStub.h>
+
+namespace torch_ipex {
+namespace cpu {
+
+std::tuple<int, int> get_cpp_typesize_and_vecsize(at::ScalarType dtype);
+
+namespace {
+
+std::tuple<int, int> get_cpp_typesize_and_vecsize_kernel_impl(
+    at::ScalarType dtype);
+}
+
+using get_cpp_typesize_and_vecsize_kernel_fn =
+    std::tuple<int, int> (*)(at::ScalarType);
+IPEX_DECLARE_DISPATCH(
+    get_cpp_typesize_and_vecsize_kernel_fn,
+    get_cpp_typesize_and_vecsize_kernel_stub);
+
+} // namespace cpu
+} // namespace torch_ipex
+
+
+
//csrc/cpu/aten/GetVecLength.cpp
+
+#include "GetVecLength.h"
+
+namespace torch_ipex {
+namespace cpu {
+
+IPEX_DEFINE_DISPATCH(get_cpp_typesize_and_vecsize_kernel_stub);
+
+// get cpp typesize and vectorsize by at::ScalarType
+std::tuple<int, int> get_cpp_typesize_and_vecsize(at::ScalarType dtype) {
+  return get_cpp_typesize_and_vecsize_kernel_stub(kCPU, dtype);
+}
+
+} // namespace cpu
+} // namespace torch_ipex
+
+
+
//csrc/cpu/aten/kernels/GetVecLengthKrnl.cpp
+
+#include <ATen/cpu/vec/vec.h>
+#include "csrc/cpu/aten/GetVecLength.h"
+
+namespace torch_ipex {
+namespace cpu {
+
+namespace {
+
+std::tuple<int, int> get_cpp_typesize_and_vecsize_kernel_impl(
+    at::ScalarType dtype) {
+  switch (dtype) {
+    case at::ScalarType::Double:
+      return std::make_tuple(
+          sizeof(double), at::vec::Vectorized<double>::size());
+    case at::ScalarType::Float:
+      return std::make_tuple(sizeof(float), at::vec::Vectorized<float>::size());
+    case at::ScalarType::ComplexDouble:
+      return std::make_tuple(
+          sizeof(c10::complex<double>),
+          at::vec::Vectorized<c10::complex<double>>::size());
+    case at::ScalarType::ComplexFloat:
+      return std::make_tuple(
+          sizeof(c10::complex<float>),
+          at::vec::Vectorized<c10::complex<float>>::size());
+    case at::ScalarType::BFloat16:
+      return std::make_tuple(
+          sizeof(decltype(
+              c10::impl::ScalarTypeToCPPType<at::ScalarType::BFloat16>::t)),
+          at::vec::Vectorized<decltype(c10::impl::ScalarTypeToCPPType<
+                                       at::ScalarType::BFloat16>::t)>::size());
+    case at::ScalarType::Half:
+      return std::make_tuple(
+          sizeof(decltype(
+              c10::impl::ScalarTypeToCPPType<at::ScalarType::Half>::t)),
+          at::vec::Vectorized<decltype(c10::impl::ScalarTypeToCPPType<
+                                       at::ScalarType::Half>::t)>::size());
+    default:
+      TORCH_CHECK(
+          false,
+          "Currently only floating and complex ScalarType are supported.");
+  }
+}
+
+} // anonymous namespace
+
+IPEX_REGISTER_DISPATCH(
+    get_cpp_typesize_and_vecsize_kernel_stub,
+    &get_cpp_typesize_and_vecsize_kernel_impl);
+
+} // namespace cpu
+} // namespace torch_ipex
+
+
+
+
+
+

Private Debug APIs

+

Here are three ISA-related private APIs that can help debugging::

+
    +
  1. Query current ISA level.

  2. +
  3. Query max CPU supported ISA level.

  4. +
  5. Query max binary supported ISA level.

  6. +
+
+

Note:

+
    +
  1. Max CPU supported ISA level only depends on CPU features.

  2. +
  3. Max binary supported ISA level only depends on built complier version.

  4. +
  5. Current ISA level, it is the smaller of max CPU ISA level and max binary ISA level.

  6. +
+
+
+

Example:

+
python
+Python 3.9.7 (default, Sep 16 2021, 13:09:58)
+[GCC 7.5.0] :: Anaconda, Inc. on linux
+Type "help", "copyright", "credits" or "license" for more information.
+>>> import intel_extension_for_pytorch._C as core
+>>> core._get_current_isa_level()
+'AMX'
+>>> core._get_highest_cpu_support_isa_level()
+'AMX'
+>>> core._get_highest_binary_support_isa_level()
+'AMX'
+>>> quit()
+
+
+
+
+
+

Select ISA level manually.

+

By default, IPEX dispatches to the kernels with the maximum ISA level supported by the underlying CPU hardware. This ISA level can be overridden by the environment variable ATEN_CPU_CAPABILITY (same environment variable as PyTorch). The available values are {avx2, avx512, avx512_vnni, avx512_bf16, amx, avx512_fp16}. The effective ISA level would be the minimal level between ATEN_CPU_CAPABILITY and the maximum level supported by the hardware.

+
+

Example:

+
$ python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())'
+AMX
+$ ATEN_CPU_CAPABILITY=avx2 python -c 'import intel_extension_for_pytorch._C as core;print(core._get_current_isa_level())'
+AVX2
+
+
+
+

Note:

+

core._get_current_isa_level() is an IPEX internal function used for checking the current effective ISA level. It is used for debugging purpose only and subject to change.

+
+
+
+
+

CPU feature check

+

An addtional CPU feature check tool in the subfolder: tests/cpu/isa

+
$ cmake .
+-- The C compiler identification is GNU 11.2.1
+-- The CXX compiler identification is GNU 11.2.1
+-- Detecting C compiler ABI info
+-- Detecting C compiler ABI info - done
+-- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/cc - skipped
+-- Detecting C compile features
+-- Detecting C compile features - done
+-- Detecting CXX compiler ABI info
+-- Detecting CXX compiler ABI info - done
+-- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/c++ - skipped
+-- Detecting CXX compile features
+-- Detecting CXX compile features - done
+-- Configuring done
+-- Generating done
+-- Build files have been written to: tests/cpu/isa
+$ make
+[ 33%] Building CXX object CMakeFiles/cpu_features.dir/intel_extension_for_pytorch/csrc/cpu/isa/cpu_feature.cpp.o
+[ 66%] Building CXX object CMakeFiles/cpu_features.dir/intel_extension_for_pytorch/csrc/cpu/isa/cpu_feature_main.cpp.o
+[100%] Linking CXX executable cpu_features
+[100%] Built target cpu_features
+$ ./cpu_features
+XCR0: 00000000000602e7
+os --> avx: true
+os --> avx2: true
+os --> avx512: true
+os --> amx: true
+mmx:                    true
+sse:                    true
+sse2:                   true
+sse3:                   true
+ssse3:                  true
+sse4_1:                 true
+sse4_2:                 true
+aes_ni:                 true
+sha:                    true
+xsave:                  true
+fma:                    true
+f16c:                   true
+avx:                    true
+avx2:                   true
+avx_vnni:                       true
+avx512_f:                       true
+avx512_cd:                      true
+avx512_pf:                      false
+avx512_er:                      false
+avx512_vl:                      true
+avx512_bw:                      true
+avx512_dq:                      true
+avx512_ifma:                    true
+avx512_vbmi:                    true
+avx512_vpopcntdq:                       true
+avx512_4fmaps:                  false
+avx512_4vnniw:                  false
+avx512_vbmi2:                   true
+avx512_vpclmul:                 true
+avx512_vnni:                    true
+avx512_bitalg:                  true
+avx512_fp16:                    true
+avx512_bf16:                    true
+avx512_vp2intersect:                    true
+amx_bf16:                       true
+amx_tile:                       true
+amx_int8:                       true
+prefetchw:                      true
+prefetchwt1:                    false
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/genindex.html b/xpu/2.3.110+xpu/genindex.html new file mode 100644 index 000000000..68460042b --- /dev/null +++ b/xpu/2.3.110+xpu/genindex.html @@ -0,0 +1,250 @@ + + + + + + Index — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + +
+ + +
+ +
+
+
+
    +
  • + +
  • +
  • +
+
+
+
+
+ + +

Index

+ +
+ E + | F + | G + | M + | O + | R + | S + | T + +
+

E

+ + +
+ +

F

+ + +
+ +

G

+ + +
+ +

M

+ + + +
+ +

O

+ + +
+ +

R

+ + + +
+ +

S

+ + +
+ +

T

+ + + +
+ + + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/index.html b/xpu/2.3.110+xpu/index.html new file mode 100644 index 000000000..b5bfeb220 --- /dev/null +++ b/xpu/2.3.110+xpu/index.html @@ -0,0 +1,196 @@ + + + + + + + + + Intel® Extension for PyTorch* — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Intel® Extension for PyTorch*

+

Intel® Extension for PyTorch* extends PyTorch* with the latest performance optimizations for Intel hardware. +Optimizations take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel XeMatrix Extensions (XMX) AI engines on Intel discrete GPUs. +Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* xpu device.

+

In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Starting from 2.1.0, specific optimizations for certain +Large Language Models (LLMs) are introduced in the Intel® Extension for PyTorch*. For more information on LLM optimizations, refer to the Large Language Models (LLMs) section.

+

The extension can be loaded as a Python module for Python programs or linked as a C++ library for C++ programs. In Python scripts, users can enable it dynamically by importing intel_extension_for_pytorch.

+
+

Note

+
    +
  • CPU features are not included in GPU-only packages.

  • +
  • GPU features are not included in CPU-only packages.

  • +
  • Optimizations for CPU-only may have a newer code base due to different development schedules.

  • +
+
+

Intel® Extension for PyTorch* has been released as an open–source project at Github. You can find the source code and instructions on how to get started at:

+ +

You can find more information about the product at:

+ +
+

Architecture

+

Intel® Extension for PyTorch* is structured as shown in the following figure:

+
+Architecture of Intel® Extension for PyTorch* +
+

Architecture of Intel® Extension for PyTorch*

+
+
+
    +
  • Eager Mode: In the eager mode, the PyTorch frontend is extended with custom Python modules (such as fusion modules), optimal optimizers, and INT8 quantization APIs. Further performance improvement is achieved by converting eager-mode models into graph mode using extended graph fusion passes.

  • +
  • Graph Mode: In the graph mode, fusions reduce operator/kernel invocation overhead, resulting in improved performance. Compared to the eager mode, the graph mode in PyTorch* normally yields better performance from the optimization techniques like operation fusion. Intel® Extension for PyTorch* amplifies them with more comprehensive graph optimizations. Both PyTorch Torchscript and TorchDynamo graph modes are supported. With Torchscript, we recommend using torch.jit.trace() as your preferred option, as it generally supports a wider range of workloads compared to torch.jit.script(). With TorchDynamo, ipex backend is available to provide good performances.

  • +
  • CPU Optimization: On CPU, Intel® Extension for PyTorch* automatically dispatches operators to underlying kernels based on detected instruction set architecture (ISA). The extension leverages vectorization and matrix acceleration units available on Intel hardware. The runtime extension offers finer-grained thread runtime control and weight sharing for increased efficiency.

  • +
  • GPU Optimization: On GPU, optimized operators and kernels are implemented and registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel GPU hardware. Intel® Extension for PyTorch* for GPU utilizes the DPC++ compiler that supports the latest SYCL* standard and also a number of extensions to the SYCL* standard, which can be found in the sycl/doc/extensions directory.

  • +
+
+
+

Support

+

The team tracks bugs and enhancement requests using GitHub issues. Before submitting a suggestion or bug report, search the existing GitHub issues to see if your issue has already been reported.

+
+
+
+
+
+
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/objects.inv b/xpu/2.3.110+xpu/objects.inv new file mode 100644 index 000000000..331b11fe4 Binary files /dev/null and b/xpu/2.3.110+xpu/objects.inv differ diff --git a/xpu/2.3.110+xpu/search.html b/xpu/2.3.110+xpu/search.html new file mode 100644 index 000000000..03b31b020 --- /dev/null +++ b/xpu/2.3.110+xpu/search.html @@ -0,0 +1,162 @@ + + + + + + Search — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+
    +
  • + +
  • +
  • +
+
+
+
+
+ + + + +
+ +
+ +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + + + + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/searchindex.js b/xpu/2.3.110+xpu/searchindex.js new file mode 100644 index 000000000..8c5133fef --- /dev/null +++ b/xpu/2.3.110+xpu/searchindex.js @@ -0,0 +1 @@ +Search.setIndex({"alltitles": {"1.10.200+gpu": [[32, "gpu"]], "1.13.10+xpu": [[32, "id19"]], "1.13.120+xpu": [[32, "id16"]], "2.0.110+xpu": [[32, "id13"]], "2.1.10+xpu": [[32, "id10"]], "2.1.20+xpu": [[32, "id7"]], "2.1.30+xpu": [[32, "id4"]], "2.1.40+xpu": [[32, "id1"]], "2.3.110+xpu": [[32, "xpu"]], "API Documentation": [[2, null], [25, "api-documentation"]], "Add Custom Kernel": [[0, "add-custom-kernel"]], "Add Profiler Into Script": [[20, "add-profiler-into-script"]], "Advanced Configuration": [[5, "advanced-configuration"], [10, null]], "Ahead of Time (AOT) Compilation": [[34, null]], "Ahead of Time Compilation (AOT) [GPU]": [[33, "ahead-of-time-compilation-aot-gpu"]], "Architecture": [[1, "architecture"]], "Asynchronous Programming": [[7, "asynchronous-programming"]], "Auto Channels Last": [[12, null]], "Auto Mixed Precision (AMP)": [[5, "auto-mixed-precision-amp"]], "Auto Mixed Precision (AMP) on GPU": [[11, null]], "Autocast Op Reference": [[11, "autocast-op-reference"]], "Automatic Channels Last": [[35, "automatic-channels-last"]], "BERT": [[4, "bert"], [4, "id3"], [4, "id7"], [4, "id10"], [4, "id13"], [4, "id16"]], "BFloat16": [[4, "bfloat16"], [4, "id4"]], "Basic Usage": [[4, "basic-usage"]], "Better local unit tests with pytest": [[3, "better-local-unit-tests-with-pytest"]], "Build Time Configuration": [[10, "build-time-configuration"]], "Build Tool": [[20, "build-tool"]], "Building documentation": [[3, "building-documentation"]], "Building with CMake": [[8, "building-with-cmake"]], "Building with setuptools": [[8, "building-with-setuptools"]], "C++": [[4, "c"]], "C++ API": [[2, "c-api"]], "CPU ISA build compiler requirement": [[0, "cpu-isa-build-compiler-requirement"]], "CPU feature check": [[0, "cpu-feature-check"]], "Channels Last": [[5, "channels-last"], [19, null]], "Channels Last 1D support on XPU": [[19, "channels-last-1d-support-on-xpu"]], "Channels Last Memory Format APIs": [[19, "channels-last-memory-format-apis"]], "Code Folder Struct": [[0, "code-folder-struct"]], "CodeGen Process": [[0, "codegen-process"]], "Compute Engine (Experimental feature for debug)": [[13, null]], "Compute Engine (Prototype feature for debug)": [[5, "compute-engine-prototype-feature-for-debug"]], "Configuration": [[31, "configuration"]], "Contributing to Intel\u00ae Extension for PyTorch*": [[3, "contributing-to-intel-extension-for-pytorch"]], "Contribution": [[3, null]], "Customize DPC++ kernels": [[4, "customize-dpc-kernels"]], "DDP Usage": [[6, "ddp-usage"]], "DDP scaling API (GPU Only)": [[6, "ddp-scaling-api-gpu-only"]], "DLDevice and data pointer": [[7, "dldevice-and-data-pointer"]], "DLPack Solution": [[5, "dlpack-solution"], [7, null]], "DPC++ Extension": [[5, "dpc-extension"], [8, null]], "Deep Fusion Policy": [[28, "deep-fusion-policy"]], "Default Precision": [[11, "default-precision"]], "Design": [[7, "design"]], "Developing Intel\u00ae Extension for PyTorch* on XPU": [[3, "developing-intel-extension-for-pytorch-on-xpu"]], "Disable Tool in Model Script": [[20, "disable-tool-in-model-script"]], "Dispatch Stub implementation: csrc/cpu/dyndisp/DispatchStub.cpp and csrc/cpu/dyndisp/DispatchStub.h": [[0, "dispatch-stub-implementation-csrc-cpu-dyndisp-dispatchstub-cpp-and-csrc-cpu-dyndisp-dispatchstub-h"]], "Distributed Inference": [[28, "distributed-inference"]], "Distributed Inference with DeepSpeed": [[30, "distributed-inference-with-deepspeed"]], "Distributed Training": [[5, "distributed-training"]], "DistributedDataParallel (DDP)": [[6, null]], "Dynamic Dispatch Design": [[0, "dynamic-dispatch-design"]], "Ease-of-use auto channels last API": [[12, "ease-of-use-auto-channels-last-api"]], "Easy-to-use Python API": [[5, "easy-to-use-python-api"]], "Enable and Disable Tool": [[22, "enable-and-disable-tool"]], "Engine Selection Policy": [[13, "engine-selection-policy"]], "Enviornment settings": [[18, "enviornment-settings"]], "Environment Setup": [[29, "environment-setup"]], "Event Log": [[18, "event-log"]], "Example": [[9, "example"]], "Example Case": [[7, "example-case"]], "Example Usage": [[23, "example-usage"]], "Example Usage (MPI launch for single node):": [[6, "example-usage-mpi-launch-for-single-node"]], "Example:": [[0, "example"], [0, "id1"]], "Examples": [[4, null]], "Execute WOQ benchmark script": [[29, "execute-woq-benchmark-script"]], "Execution": [[24, "execution"]], "Export DLPack Capsule": [[7, "export-dlpack-capsule"]], "Export to Chrome Trace": [[20, "export-to-chrome-trace"]], "FP16": [[30, "fp16"]], "FP8 Quantization": [[15, "fp8-quantization"]], "FP8 usage example": [[15, "fp8-usage-example"]], "FSDP Usage (GPU only)": [[9, "fsdp-usage-gpu-only"]], "Features": [[5, null]], "Fetching the corresponding sycl::queue": [[8, "fetching-the-corresponding-sycl-queue"]], "Float16": [[4, "float16"]], "Float32": [[4, "float32"], [4, "id1"]], "Float8 Data Type": [[15, "float8-data-type"]], "Float8 Data Type Support (Prototype)": [[15, null]], "Fully Sharded Data Parallel (FSDP)": [[5, "fully-sharded-data-parallel-fsdp"], [9, null]], "General": [[2, "general"]], "General Usage": [[26, "general-usage"]], "Get Started": [[25, "get-started"]], "Hardware Configuration": [[31, "hardware-configuration"]], "Highlights": [[32, "highlights"], [32, "id2"], [32, "id5"], [32, "id8"], [32, "id11"], [32, "id14"], [32, "id17"], [32, "id20"], [32, "id22"]], "Horovod with PyTorch (Prototype)": [[16, null]], "Horovod with PyTorch Usage": [[16, "horovod-with-pytorch-usage"]], "INT8": [[4, "int8"]], "IPEX_LOG (Prototype feature for debug)": [[5, "ipex-log-prototype-feature-for-debug"]], "IPEX_LOG (Prototype)": [[18, null]], "IPEX_LOG Definition": [[18, "ipex-log-definition"]], "ISA intrinics specific kernel example:": [[0, "isa-intrinics-specific-kernel-example"]], "Imperative Mode": [[4, "imperative-mode"], [4, "id5"], [4, "id11"], [17, "imperative-mode"]], "Imperative mode": [[30, "imperative-mode"]], "Import DLPack Capsule": [[7, "import-dlpack-capsule"]], "Inference": [[4, "inference"]], "Inference with Imperative Path": [[11, "inference-with-imperative-path"]], "Inference with TorchScript Path": [[11, "inference-with-torchscript-path"]], "Inferenece with torch.compile": [[23, "inferenece-with-torch-compile"]], "Install Horovod with PyTorch": [[16, "install-horovod-with-pytorch"]], "Install Intel-extension-for-transformers and Neural-compressor": [[29, "install-intel-extension-for-transformers-and-neural-compressor"]], "Install Intel\u00ae oneCCL Bindings for Pytorch*": [[6, "install-intel-oneccl-bindings-for-pytorch"]], "Install PyTorch and Intel\u00ae Extension for PyTorch*": [[6, "install-pytorch-and-intel-extension-for-pytorch"]], "Install from source": [[6, "install-from-source"]], "Installation of Intel\u00ae oneCCL Bindings for Pytorch*": [[6, "installation-of-intel-oneccl-bindings-for-pytorch"]], "Intel\u00ae AI Reference Models": [[4, "intel-ai-reference-models"]], "Intel\u00ae Extension for PyTorch*": [[1, null]], "Intel\u00ae Extension for PyTorch* - DeepSpeed* Kernels": [[14, null]], "Intel\u00ae Extension for PyTorch* CPU ISA Dynamic Dispatch Design Doc": [[0, null]], "Intel\u00ae Extension for PyTorch* Optimizations for Quantization [GPU]": [[17, null]], "Introduction": [[6, "introduction"], [7, "introduction"], [8, "introduction"], [9, "introduction"], [11, "introduction"], [13, "introduction"], [14, "introduction"], [18, "introduction"], [20, "introduction"], [21, "introduction"], [22, "introduction"], [23, "introduction"], [25, null], [29, "introduction"], [34, "introduction"], [37, "introduction"]], "JIT Compiling Extensions": [[8, "jit-compiling-extensions"]], "Kernel Stub: csrc/cpu/aten/xyz.cpp and csrc/cpu/aten/xyz.h": [[0, "kernel-stub-csrc-cpu-aten-xyz-cpp-and-csrc-cpu-aten-xyz-h"]], "Kernel implementation: csrc/cpu/aten/kernels/xyzKrnl.cpp": [[0, "kernel-implementation-csrc-cpu-aten-kernels-xyzkrnl-cpp"]], "Kineto Supported Profiler Tool (Prototype)": [[5, "kineto-supported-profiler-tool-prototype"], [20, null]], "Known Issues": [[32, "known-issues"], [32, "id3"], [32, "id6"], [32, "id9"], [32, "id12"], [32, "id15"], [32, "id18"], [32, "id21"], [32, "id23"]], "Known issue": [[12, "known-issue"]], "LLM Inference": [[28, "llm-inference"]], "LLM Performance v2.1.10": [[31, "llm-performance-v2-1-10"]], "LLM fine-tuning on Intel\u00ae Core\u2122 Ultra Processors with Intel\u00ae Arc\u2122 Graphics": [[28, "llm-fine-tuning-on-intel-core-ultra-processors-with-intel-arc-graphics"]], "LLM fine-tuning on Intel\u00ae Data Center Max 1550 GPU": [[28, "llm-fine-tuning-on-intel-data-center-max-1550-gpu"]], "Large Language Models (LLM) Optimizations Overview": [[28, null]], "Legacy Profiler Tool (Deprecated)": [[21, null]], "Library Dependencies": [[26, "library-dependencies"]], "License": [[27, null]], "Linear Operator Optimization": [[28, "linear-operator-optimization"]], "Log Component": [[18, "log-component"]], "Log Level": [[18, "log-level"]], "Low Precision Data Types": [[28, "low-precision-data-types"]], "Memory Format Is All That Matters": [[19, "memory-format-is-all-that-matters"]], "Memory Management": [[36, null]], "Memory Management [GPU]": [[33, "memory-management-gpu"]], "Memory management": [[2, "memory-management"]], "Motivation and Example": [[8, "motivation-and-example"]], "Multiple Implementations Operators and Engines": [[13, "multiple-implementations-operators-and-engines"]], "Op Eligibility": [[11, "op-eligibility"]], "Op-Specific Behavior": [[11, "op-specific-behavior"]], "Operation Fusion": [[37, "operation-fusion"]], "Ops that can autocast to bfloat16": [[11, "ops-that-can-autocast-to-bfloat16"]], "Ops that can autocast to float16": [[11, "ops-that-can-autocast-to-float16"]], "Ops that can autocast to float32": [[11, "ops-that-can-autocast-to-float32"]], "Ops that promote to the widest input type": [[11, "ops-that-promote-to-the-widest-input-type"]], "Optimization Methodologies": [[28, "optimization-methodologies"]], "Optimizer Fusion on GPU": [[37, null]], "Optimizer Optimization [GPU]": [[33, "optimizer-optimization-gpu"]], "Overview": [[0, "overview"], [31, "overview"]], "Performance": [[31, null]], "Performance Data for Intel\u00ae AI Data Center Products": [[31, "performance-data-for-intel-ai-data-center-products"]], "Performance Issue": [[26, "performance-issue"]], "Private Debug APIs": [[0, "private-debug-apis"]], "Profile on Multi-device Application": [[20, "profile-on-multi-device-application"]], "Pseudocode of Common Usage Scenarios": [[30, "pseudocode-of-common-usage-scenarios"]], "PyTorch Strided Layout": [[19, "pytorch-strided-layout"]], "Python": [[4, "python"]], "Quantization": [[2, "quantization"], [5, "quantization"]], "Quantize Model and Inference": [[29, "quantize-model-and-inference"]], "Quick Start": [[24, null]], "References": [[29, "references"]], "Releases": [[32, null]], "Replace IPEX_SIMPLE_TRACE": [[18, "replace-ipex-simple-trace"]], "Replace IPEX_VERBOSE": [[18, "replace-ipex-verbose"]], "Requesting the current c10::xpu::XPUStream": [[8, "requesting-the-current-c10-xpu-xpustream"]], "Required Dependencies": [[23, "required-dependencies"]], "Requirement": [[34, "requirement"]], "Resnet50": [[4, "resnet50"], [4, "id2"], [4, "id6"], [4, "id9"], [4, "id12"], [4, "id15"]], "Result": [[20, "result"]], "Results": [[22, "results"]], "Run Weight-Only Quantization LLM on Intel\u00ae GPU": [[29, "run-weight-only-quantization-llm-on-intel-gpu"]], "Runtime Configuration": [[10, "runtime-configuration"]], "Runtime Dynamic Linking": [[6, "runtime-dynamic-linking"]], "Save and Load Quantized Model (Optional)": [[29, "save-and-load-quantized-model-optional"]], "Segment KV Cache": [[28, "segment-kv-cache"]], "Select ISA level manually.": [[0, "select-isa-level-manually"]], "Simple Log": [[18, "simple-log"]], "Simple Trace Tool (Deprecated)": [[22, null]], "Single-Instance Training": [[4, "single-instance-training"]], "SmoothQuant": [[30, "smoothquant"]], "Software Version": [[31, "software-version"]], "Support": [[1, "support"]], "Supported Framework Model Matrix": [[29, "supported-framework-model-matrix"]], "Supported Platform": [[14, "supported-platform"]], "Supported operators": [[15, "supported-operators"]], "Supported running mode": [[15, "supported-running-mode"]], "Technical Details": [[33, null]], "Tips": [[3, "tips"]], "Tips and Debugging": [[3, "tips-and-debugging"]], "TorchScript Mode": [[4, "torchscript-mode"], [4, "id8"], [4, "id14"], [17, "torchscript-mode"], [30, "torchscript-mode"]], "Training": [[4, "training"]], "Training Support": [[11, "training-support"]], "Training with torch.compile": [[23, "training-with-torch-compile"]], "Transformers Optimization Frontend API": [[30, null]], "Troubleshooting": [[26, null]], "Unit Test": [[26, "unit-test"]], "Unit testing": [[3, "unit-testing"]], "Usage in C++": [[18, "usage-in-c"]], "Usage in python": [[18, "usage-in-python"]], "Usage of DDP scaling API": [[6, "usage-of-ddp-scaling-api"]], "Usage of running Weight-Only Quantization LLM For Intel\u00ae GPU": [[29, "usage-of-running-weight-only-quantization-llm-for-intel-gpu"]], "Use Case": [[7, "use-case"], [11, "use-case"], [13, "use-case"], [20, "use-case"], [22, "use-case"]], "Use SYCL code": [[4, "use-sycl-code"]], "Use Simple Trace in Model": [[22, "use-simple-trace-in-model"]], "Use Tool": [[20, "use-tool"]], "Use case": [[34, "use-case"]], "Using accessors": [[8, "using-accessors"]], "Validated Models List": [[28, "validated-models-list"]], "Vec specific kernel example:": [[0, "vec-specific-kernel-example"]], "Weight Only Quantization INT4": [[28, "weight-only-quantization-int4"]], "Weight-Only Quantization (Prototype)": [[29, null]], "Weight-Only Quantization Initialization": [[29, "weight-only-quantization-initialization"]], "Weight-Only Quantization LLM features in Intel\u00ae Extension for PyTorch*": [[29, "weight-only-quantization-llm-features-in-intel-extension-for-pytorch"]], "Weight-Only Quantization Linear Dispatch": [[29, "weight-only-quantization-linear-dispatch"]], "Weight-Only Quantization Runtime": [[29, "weight-only-quantization-runtime"]], "What is Channels Last": [[19, "what-is-channels-last"]], "Writing Channels Last Kernels on CPU": [[19, "writing-channels-last-kernels-on-cpu"]], "Writing a DPC++ Extension": [[8, "writing-a-dpc-extension"]], "Writing documentation": [[3, "writing-documentation"]], "Writing the DPC++ Op": [[8, "writing-the-dpc-op"]], "[Recommended] Install from prebuilt wheels": [[6, "recommended-install-from-prebuilt-wheels"]], "a. Create NHWC Memory": [[19, "a-create-nhwc-memory"]], "a. NCHW (default)": [[19, "a-nchw-default"]], "a. Register Channels Last Kernel in ATen Native Manner": [[19, "a-register-channels-last-kernel-in-aten-native-manner"]], "a. tensor conversion with Channels Last 1D": [[19, "a-tensor-conversion-with-channels-last-1d"]], "a. tensor creation": [[19, "a-tensor-creation"]], "b. Create Convolution Primitive": [[19, "b-create-convolution-primitive"]], "b. NHWC": [[19, "b-nhwc"]], "b. Register oneDNN Kernel on Channels Last": [[19, "b-register-onednn-kernel-on-channels-last"]], "b. model conversion with Channels Last 1D": [[19, "b-model-conversion-with-channels-last-1d"]], "b. tensor conversion": [[19, "b-tensor-conversion"]], "c. Blocked (nChw16c, on CPU)": [[19, "c-blocked-nchw16c-on-cpu"]], "c. determine if in Channels Last 1D memory format": [[19, "c-determine-if-in-channels-last-1d-memory-format"]], "c. model conversion": [[19, "c-model-conversion"]], "conv_bn_folding": [[35, "conv-bn-folding"]], "d. operator coverage in PyTorch": [[19, "d-operator-coverage-in-pytorch"]], "default": [[12, "default"]], "disable": [[12, "disable"]], "enable": [[12, "enable"]], "fuse_update_step": [[35, "fuse-update-step"]], "ipex.optimize Frontend API": [[35, null]], "ipex.optimize [GPU]": [[33, "ipex-optimize-gpu"]], "linear_bn_folding": [[35, "linear-bn-folding"]], "oneDNN NHWC APIs": [[19, "onednn-nhwc-apis"]], "replace_dropout_with_identity": [[35, "replace-dropout-with-identity"]], "split_master_weight_for_bf16": [[35, "split-master-weight-for-bf16"]], "torch.compile for GPU (Beta)": [[5, "torch-compile-for-gpu-beta"], [23, null]], "torch.xpu.optimize": [[4, "torch-xpu-optimize"]]}, "docnames": ["design_doc/cpu/isa_dyndisp", "index", "tutorials/api_doc", "tutorials/contribution", "tutorials/examples", "tutorials/features", "tutorials/features/DDP", "tutorials/features/DLPack", "tutorials/features/DPC++_Extension", "tutorials/features/FSDP", "tutorials/features/advanced_configuration", "tutorials/features/amp_gpu", "tutorials/features/auto_channels_last", "tutorials/features/compute_engine", "tutorials/features/deepspeed_kernels", "tutorials/features/float8", "tutorials/features/horovod", "tutorials/features/int8_overview_xpu", "tutorials/features/ipex_log", "tutorials/features/nhwc", "tutorials/features/profiler_kineto", "tutorials/features/profiler_legacy", "tutorials/features/simple_trace", "tutorials/features/torch_compile_gpu", "tutorials/getting_started", "tutorials/introduction", "tutorials/known_issues", "tutorials/license", "tutorials/llm", "tutorials/llm/int4_weight_only_quantization", "tutorials/llm/llm_optimize_transformers", "tutorials/performance", "tutorials/releases", "tutorials/technical_details", "tutorials/technical_details/AOT", "tutorials/technical_details/ipex_optimize", "tutorials/technical_details/memory_management", "tutorials/technical_details/optimizer_fusion_gpu"], "envversion": {"sphinx": 62, "sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2}, "filenames": ["design_doc/cpu/isa_dyndisp.md", "index.rst", "tutorials/api_doc.rst", "tutorials/contribution.md", "tutorials/examples.md", "tutorials/features.rst", "tutorials/features/DDP.md", "tutorials/features/DLPack.md", "tutorials/features/DPC++_Extension.md", "tutorials/features/FSDP.md", "tutorials/features/advanced_configuration.md", "tutorials/features/amp_gpu.md", "tutorials/features/auto_channels_last.md", "tutorials/features/compute_engine.md", "tutorials/features/deepspeed_kernels.md", "tutorials/features/float8.md", "tutorials/features/horovod.md", "tutorials/features/int8_overview_xpu.md", "tutorials/features/ipex_log.md", "tutorials/features/nhwc.md", "tutorials/features/profiler_kineto.md", "tutorials/features/profiler_legacy.md", "tutorials/features/simple_trace.md", "tutorials/features/torch_compile_gpu.md", "tutorials/getting_started.md", "tutorials/introduction.rst", "tutorials/known_issues.md", "tutorials/license.md", "tutorials/llm.rst", "tutorials/llm/int4_weight_only_quantization.md", "tutorials/llm/llm_optimize_transformers.md", "tutorials/performance.md", "tutorials/releases.md", "tutorials/technical_details.rst", "tutorials/technical_details/AOT.md", "tutorials/technical_details/ipex_optimize.md", "tutorials/technical_details/memory_management.rst", "tutorials/technical_details/optimizer_fusion_gpu.md"], "indexentries": {"empty_cache() (in module intel_extension_for_pytorch.xpu)": [[2, "intel_extension_for_pytorch.xpu.empty_cache", false]], "fp8_autocast() (in module intel_extension_for_pytorch.quantization.fp8)": [[2, "intel_extension_for_pytorch.quantization.fp8.fp8_autocast", false]], "get_fp32_math_mode() (in module intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.get_fp32_math_mode", false]], "max_memory_allocated() (in module intel_extension_for_pytorch.xpu)": [[2, "intel_extension_for_pytorch.xpu.max_memory_allocated", false]], "max_memory_reserved() (in module intel_extension_for_pytorch.xpu)": [[2, "intel_extension_for_pytorch.xpu.max_memory_reserved", false]], "memory_allocated() (in module intel_extension_for_pytorch.xpu)": [[2, "intel_extension_for_pytorch.xpu.memory_allocated", false]], "memory_reserved() (in module intel_extension_for_pytorch.xpu)": [[2, "intel_extension_for_pytorch.xpu.memory_reserved", false]], "memory_snapshot() (in module intel_extension_for_pytorch.xpu)": [[2, "intel_extension_for_pytorch.xpu.memory_snapshot", false]], "memory_stats() (in module intel_extension_for_pytorch.xpu)": [[2, "intel_extension_for_pytorch.xpu.memory_stats", false]], "memory_stats_as_nested_dict() (in module intel_extension_for_pytorch.xpu)": [[2, "intel_extension_for_pytorch.xpu.memory_stats_as_nested_dict", false]], "memory_summary() (in module intel_extension_for_pytorch.xpu)": [[2, "intel_extension_for_pytorch.xpu.memory_summary", false]], "optimize() (in module intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.optimize", false]], "optimize() (in module intel_extension_for_pytorch.llm)": [[2, "intel_extension_for_pytorch.llm.optimize", false]], "reset_accumulated_memory_stats() (in module intel_extension_for_pytorch.xpu)": [[2, "intel_extension_for_pytorch.xpu.reset_accumulated_memory_stats", false]], "reset_peak_memory_stats() (in module intel_extension_for_pytorch.xpu)": [[2, "intel_extension_for_pytorch.xpu.reset_peak_memory_stats", false]], "set_fp32_math_mode() (in module intel_extension_for_pytorch)": [[2, "intel_extension_for_pytorch.set_fp32_math_mode", false]], "torch_ipex::xpu::fp32_math_mode (c++ enum)": [[2, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODEE", false]], "torch_ipex::xpu::fp32_math_mode::bf32 (c++ enumerator)": [[2, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE4BF32E", false]], "torch_ipex::xpu::fp32_math_mode::fp32 (c++ enumerator)": [[2, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE4FP32E", false]], "torch_ipex::xpu::fp32_math_mode::fp32_math_mode_max (c++ enumerator)": [[2, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE18FP32_MATH_MODE_MAXE", false]], "torch_ipex::xpu::fp32_math_mode::fp32_math_mode_min (c++ enumerator)": [[2, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE18FP32_MATH_MODE_MINE", false]], "torch_ipex::xpu::fp32_math_mode::tf32 (c++ enumerator)": [[2, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE4TF32E", false]], "torch_ipex::xpu::set_fp32_math_mode (c++ function)": [[2, "_CPPv4N10torch_ipex3xpu18set_fp32_math_modeE14FP32_MATH_MODE", false]]}, "objects": {"": [[2, 0, 1, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE4BF32E", "torch_ipex::xpu::BF32"], [2, 0, 1, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE4FP32E", "torch_ipex::xpu::FP32"], [2, 1, 1, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODEE", "torch_ipex::xpu::FP32_MATH_MODE"], [2, 0, 1, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE4BF32E", "torch_ipex::xpu::FP32_MATH_MODE::BF32"], [2, 0, 1, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE4FP32E", "torch_ipex::xpu::FP32_MATH_MODE::FP32"], [2, 0, 1, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE18FP32_MATH_MODE_MAXE", "torch_ipex::xpu::FP32_MATH_MODE::FP32_MATH_MODE_MAX"], [2, 0, 1, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE18FP32_MATH_MODE_MINE", "torch_ipex::xpu::FP32_MATH_MODE::FP32_MATH_MODE_MIN"], [2, 0, 1, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE4TF32E", "torch_ipex::xpu::FP32_MATH_MODE::TF32"], [2, 0, 1, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE18FP32_MATH_MODE_MAXE", "torch_ipex::xpu::FP32_MATH_MODE_MAX"], [2, 0, 1, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE18FP32_MATH_MODE_MINE", "torch_ipex::xpu::FP32_MATH_MODE_MIN"], [2, 0, 1, "_CPPv4N10torch_ipex3xpu14FP32_MATH_MODE4TF32E", "torch_ipex::xpu::TF32"], [2, 2, 1, "_CPPv4N10torch_ipex3xpu18set_fp32_math_modeE14FP32_MATH_MODE", "torch_ipex::xpu::set_fp32_math_mode"], [2, 3, 1, "_CPPv4N10torch_ipex3xpu18set_fp32_math_modeE14FP32_MATH_MODE", "torch_ipex::xpu::set_fp32_math_mode::mode"]], "intel_extension_for_pytorch": [[2, 4, 1, "", "get_fp32_math_mode"], [2, 4, 1, "", "optimize"], [2, 4, 1, "", "set_fp32_math_mode"]], "intel_extension_for_pytorch.llm": [[2, 4, 1, "", "optimize"]], "intel_extension_for_pytorch.quantization.fp8": [[2, 4, 1, "", "fp8_autocast"]], "intel_extension_for_pytorch.xpu": [[2, 4, 1, "", "empty_cache"], [2, 4, 1, "", "max_memory_allocated"], [2, 4, 1, "", "max_memory_reserved"], [2, 4, 1, "", "memory_allocated"], [2, 4, 1, "", "memory_reserved"], [2, 4, 1, "", "memory_snapshot"], [2, 4, 1, "", "memory_stats"], [2, 4, 1, "", "memory_stats_as_nested_dict"], [2, 4, 1, "", "memory_summary"], [2, 4, 1, "", "reset_accumulated_memory_stats"], [2, 4, 1, "", "reset_peak_memory_stats"]]}, "objnames": {"0": ["cpp", "enumerator", "C++ enumerator"], "1": ["cpp", "enum", "C++ enum"], "2": ["cpp", "function", "C++ function"], "3": ["cpp", "functionParam", "C++ function parameter"], "4": ["py", "function", "Python function"]}, "objtypes": {"0": "cpp:enumerator", "1": "cpp:enum", "2": "cpp:function", "3": "cpp:functionParam", "4": "py:function"}, "terms": {"": [2, 6, 8, 9, 11, 18, 19, 20, 26, 29, 37], "0": [0, 1, 3, 4, 6, 8, 9, 10, 11, 16, 18, 20, 22, 23, 26, 27, 29, 31, 37], "00000000000602e7": 0, "001": [4, 6, 11], "00978": 29, "04": [26, 31, 32], "05516": 29, "09": 0, "09557": 29, "0f": 8, "0x00007fd552f36000": 4, "0x00007fd55321b000": 4, "0x00007fd553600000": 4, "0x00007fd553a9c000": 4, "0x00007fd553b11000": 4, "0x00007fd55511d000": 4, "0x00007fd55512c000": 4, "0x00007fd57eb1d000": 4, "0x00007fd5806cc000": 4, "0x00007fd584ab0000": 4, "0x00007fd5862b0000": 4, "0x00007fd5a1a1b000": 4, "0x00007fd5a44d8000": 4, "0x00007fd5bb895000": 4, "0x00007fd5bb927000": 4, "0x1": 0, "0x2b0004b1": 31, "0x7fff": 0, "0xffff": 0, "1": [0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, 18, 19, 20, 22, 23, 26, 28, 29, 35, 37], "10": [0, 3, 7, 9, 18, 19, 20, 22], "100": [0, 9, 16, 20, 22, 31, 32], "1000": 9, "10004": 29, "1020": 31, "1024": [8, 20, 31], "1024gb": 31, "10944": 29, "11": [0, 22], "12": [0, 22, 32], "123": 6, "12355": 9, "127": 6, "128": [4, 9, 11], "128k": 28, "13": [0, 22, 26], "1307": 9, "13b": [28, 31, 32], "14": [9, 22], "15": [0, 22], "1550": [31, 32], "16": [0, 22, 31], "17": [4, 8, 22, 31], "170": [26, 32, 34], "17323": 29, "18": [8, 22, 32], "19": 22, "1b2f15840e": 23, "1d": 10, "1mb": 2, "2": [0, 1, 4, 6, 7, 8, 9, 11, 18, 19, 20, 22, 23, 26, 27, 28, 29, 31, 33, 34, 35], "20": [13, 19, 22, 26], "200": 29, "2019": 2, "2020": 7, "2021": [0, 26, 32], "2022": [29, 32], "2023": [29, 31, 32], "202306": 29, "2024": [23, 31, 32], "202x": 4, "2048": 8, "21": 32, "22": [26, 31], "2206": 29, "2210": 29, "224": [4, 11, 23], "23": 26, "2304190630": 31, "2306": 29, "2309": 29, "2310": 29, "24": 26, "25": [9, 31], "262618": 22, "29500": 6, "2d": [19, 35], "2f": 9, "2gb": 32, "3": [0, 3, 4, 6, 8, 9, 11, 13, 18, 19, 20, 22, 23, 28, 29, 31, 35], "3081": 9, "30b": 28, "31": 31, "32": [9, 19], "33": 0, "33_offlin": 20, "3696": 32, "3706": 32, "3788": 32, "3796": 32, "3808": 32, "3829": 32, "3841": 32, "3882": 32, "3887": 32, "3970": 32, "3_25mhzi_quad_dameni_oam600w_ifrv2332i_pscnull_ifwi": 31, "3b": 28, "3d": 19, "4": [6, 8, 18, 19, 22, 29, 32], "4280": 32, "4317": 32, "4354": 32, "4358": 32, "4361": 32, "4407": 32, "4429": 32, "4439": 32, "4450": 32, "4463": 32, "4468": 32, "4480": 32, "4495": 32, "4504": 32, "4527": 32, "4557": 32, "4558": 32, "4800": 31, "4bit": 29, "4dee": 20, "4f": 9, "4fc181b0": 31, "4k": 28, "4x": 31, "5": [0, 4, 6, 9, 13, 17, 18, 19, 22, 29, 30, 31, 32], "50": 19, "500mb": [33, 34], "512": [1, 4, 32], "56": 31, "58": 0, "5987ec30": 20, "5gb": [33, 34], "5x": 32, "6": [3, 4, 22, 31], "64": [9, 11, 23, 29], "64gb": 31, "66": 0, "6b": [28, 29, 31, 32], "6f": 9, "7": [0, 7, 9, 22], "70b": [28, 32], "736": 31, "7b": [28, 29, 31, 32], "7b1": 28, "7b97a1113488": 20, "8": [8, 15, 22], "80": 3, "8480": 31, "86b": 31, "870f": 20, "8b": [28, 32], "9": [0, 4, 20, 22, 32], "9216": 9, "9525": 31, "997": 26, "A": [0, 3, 4, 5, 17, 19, 26, 29, 32, 34], "And": [8, 32], "As": [8, 17, 26, 29, 37], "At": [0, 8, 28], "Be": 17, "But": [0, 19], "By": [0, 2, 5, 10, 35], "FOR": 15, "For": [0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 18, 19, 20, 24, 25, 26, 28, 32, 33, 34, 35, 36, 37], "If": [0, 2, 3, 4, 6, 7, 8, 11, 12, 13, 17, 18, 19, 26, 29, 32, 33, 34, 35], "In": [0, 1, 2, 4, 5, 8, 11, 13, 19, 20, 22, 28, 29, 32], "It": [0, 2, 5, 6, 7, 8, 9, 14, 18, 19, 22, 24, 26, 29, 30, 32, 33, 34], "Its": [33, 35], "NOT": [14, 19], "No": [2, 19, 22, 26, 32], "ON": [10, 20, 22, 31], "On": [1, 5, 15, 19, 28, 29], "One": [7, 15, 19, 37], "Or": 6, "Such": [0, 20, 32], "The": [0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34, 35, 37], "Then": [3, 6, 17, 26, 29, 32], "There": [18, 20, 24, 28], "These": [1, 4, 11, 15, 26, 28, 29, 32], "To": [0, 2, 3, 4, 5, 6, 9, 16, 19, 20, 22, 24, 28, 29, 32], "Will": 19, "With": [1, 2, 5, 6, 8, 16, 17, 20, 22, 32], "_": [0, 4, 6, 7, 8, 14, 17, 18, 19, 22, 26, 29, 32, 35], "__init__": [3, 6, 9, 11, 19], "__m256i": 0, "__m512": 0, "__m512i": 0, "__main__": [6, 9], "__name__": [6, 9], "__release_lnx": 32, "_appli": 19, "_build": 3, "_c": [0, 2], "_cmp_ord_q": 0, "_convolut": 11, "_cvt_fp32_to_bf16": 0, "_distributed_c10d": 2, "_get_current_isa_level": 0, "_get_highest_binary_support_isa_level": 0, "_get_highest_cpu_support_isa_level": 0, "_glibcxx_use_cxx11_abi": 26, "_local_scalar_dens": 22, "_mm256_mask_storeu_epi16": 0, "_mm256_storeu_si256": 0, "_mm512_add_epi32": 0, "_mm512_and_si512": 0, "_mm512_castps_si512": 0, "_mm512_cmp_ps_mask": 0, "_mm512_cvtneps_pbh": 0, "_mm512_cvtusepi32_epi16": 0, "_mm512_loadu_p": 0, "_mm512_mask_blend_epi32": 0, "_mm512_maskz_loadu_p": 0, "_mm512_set1_epi32": 0, "_mm512_srli_epi32": 0, "_optimizer_util": 35, "_original_step": 35, "_parameter_wrapp": 35, "_recurs": 4, "_reshape_alia": 22, "_thnn_fused_gru_cel": 11, "_unique2": 22, "_xpu": 8, "_znk5torch8autograd4node4nameb5cxx11ev": [26, 32], "a770": 32, "a_0": 20, "a_1": 20, "ab": [18, 32], "abbrevi": 2, "abi": [0, 4, 32], "about": [1, 2, 3, 6, 8, 9], "abov": [3, 6, 7, 9, 10, 18, 19, 20, 22, 28, 37], "absenc": 26, "absolut": 4, "abstract": 8, "acceler": [1, 2, 5, 15, 23, 25, 29, 32], "accept": [10, 18], "access": [7, 8, 19, 28, 32, 37], "accommod": 19, "accomplish": 16, "accord": [2, 28, 29], "accumul": 2, "accur": [11, 28, 29], "accuraci": [2, 9, 11, 28, 29, 32], "achiev": [1, 2, 4, 32], "across": [2, 5, 6, 9], "action": [9, 18], "activ": [2, 4, 15, 16, 17, 20, 23, 26, 28, 29, 30], "active_byt": 2, "actual": [7, 8, 19, 26, 32], "ad": [2, 4, 20, 23, 32], "adadelta": 9, "adam": 37, "adamw": [32, 37], "adaptiveaveragepoolingkrnl": 0, "add": [10, 11, 14, 16, 18, 19, 22, 26, 32, 37], "add_": 37, "add_argu": 9, "add_execut": 4, "add_librari": 8, "addbmm": 11, "addcdiv": 11, "addcmul": 11, "addit": [0, 2, 4, 29, 32, 33, 34, 35], "addition": [23, 29], "addmm": [8, 11, 32], "addmm_": 11, "addmv": 11, "addr": 11, "address": [6, 10, 19, 26], "addtion": 0, "adjust": [6, 29], "adopt": [7, 28, 29, 32], "advanc": [1, 8, 24, 29, 32, 36], "advantag": [1, 2, 12, 19, 25, 32], "aes_ni": 0, "after": [2, 3, 4, 8, 17, 20, 22, 24, 26, 29, 37], "afterward": 8, "again": [3, 37], "against": 4, "agnost": 8, "agre": 3, "ahead": [3, 8, 20, 22], "ai": [1, 10, 25, 26, 28, 32], "aka": 19, "akdlm": 20, "al": 29, "algebra": [13, 28], "algorithm": [10, 15, 19, 29, 32], "alia": [2, 4, 8], "align": [0, 9, 18, 20, 32], "align_corn": 13, "all": [0, 2, 3, 4, 6, 8, 9, 10, 11, 13, 16, 18, 20, 22, 26, 28, 29, 30, 31, 32, 35, 36, 37], "all_reduc": 9, "allgath": [6, 9, 16, 32], "alloc": [2, 7, 16, 18, 28, 33, 36], "allocated_byt": 2, "allow": [2, 4, 10, 11, 26, 29, 32, 33, 34], "allreduc": [2, 6, 16, 32], "alltoal": [6, 16], "almost": 19, "along": [2, 3, 22], "alpha": [8, 37], "alreadi": [1, 3, 4, 16, 19, 28, 33], "also": [1, 2, 4, 5, 7, 8, 10, 17, 18, 19, 20, 26, 28, 29, 30, 32, 33, 34, 36, 37], "altern": [2, 4, 5, 6, 19], "although": 2, "alwai": [3, 11, 13, 19, 26, 32], "amax": 2, "amc": 31, "among": [5, 7, 16], "amount": [2, 36], "amp": [4, 23, 24, 32], "amp_dtyp": 30, "amplifi": 1, "amx": [0, 1, 32], "amx_bf16": 0, "amx_int8": 0, "amx_til": 0, "an": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 16, 17, 18, 19, 20, 26, 28, 29, 32, 37], "anaconda": 0, "analysi": 17, "ani": [0, 2, 3, 11, 18, 19, 26, 32, 33], "annot": 10, "anonym": 0, "anoth": 29, "answer": 19, "anymor": 5, "aot": [8, 10, 32], "apach": [16, 27], "api": [1, 4, 8, 9, 10, 17, 20, 23, 28, 29, 32, 33], "app": 4, "appear": [26, 32], "append": 4, "appli": [2, 4, 5, 11, 16, 19, 24, 28, 29, 30, 32, 37], "applic": [1, 2, 4, 28, 29, 33, 34, 35, 36], "approach": [8, 26, 28], "appropri": 33, "ar": [0, 1, 2, 3, 4, 5, 7, 8, 10, 11, 14, 15, 16, 17, 18, 19, 20, 22, 24, 26, 28, 29, 30, 32, 33, 34, 35, 37], "arc": [26, 29, 32, 34], "architectur": [10, 28, 29], "arg": [6, 9, 16, 18, 20, 37], "argc": 4, "argmax": 9, "argpars": 9, "argument": [6, 9, 13, 20, 29], "argumentpars": 9, "argv": 4, "around": 8, "arrai": [19, 35], "arrow": 22, "arxiv": 29, "as_strid": 22, "ask": 3, "assert": [10, 20], "assign": [19, 20], "associ": [8, 33], "assum": [11, 24], "asymmetr": [17, 32], "asynchron": 10, "at_dispatch_all_typ": 8, "at_dispatch_floating_typ": 8, "atan2": 11, "aten": [2, 4, 7, 8, 10], "aten_cpu_cap": 0, "atendlmtensor": 7, "atenipextypexpu": [22, 29], "ats": 34, "attempt": [18, 33], "attent": [1, 28, 29, 32], "attention_mask": 15, "attn": 32, "attribut": [2, 19], "auto": [0, 2, 4, 8, 19, 28, 32], "auto_kernel_select": 2, "autocast": [4, 23, 24], "autoclass": 3, "autofunct": 3, "autom": 11, "automat": [1, 2, 4, 5, 12, 17, 19, 22, 26, 28, 29, 32, 33, 34], "automaticlli": 2, "automodelforcausallm": [29, 30, 32], "autoround": 29, "autotoken": 29, "avail": [0, 1, 2, 4, 5, 6, 8, 10, 13, 24, 29, 30, 32, 36], "aval": 8, "averag": [9, 16, 20], "averagepool2d": 13, "avg": 20, "avg_pool": 18, "avoid": [2, 3, 8, 26], "avx": [0, 1, 32], "avx2": 0, "avx256": 0, "avx2_vnni": 0, "avx512": [0, 19, 32], "avx512_4fmap": 0, "avx512_4vnniw": 0, "avx512_bf16": 0, "avx512_bitalg": 0, "avx512_bw": 0, "avx512_cd": 0, "avx512_dq": 0, "avx512_er": 0, "avx512_f": 0, "avx512_fp16": 0, "avx512_ifma": 0, "avx512_pf": 0, "avx512_vbmi": 0, "avx512_vbmi2": 0, "avx512_vl": 0, "avx512_vnni": 0, "avx512_vp2intersect": 0, "avx512_vpclmul": 0, "avx512_vpopcntdq": 0, "avx_vnni": 0, "awar": [19, 29], "awq": 29, "b": [3, 11, 22], "b4": 31, "b_0": 20, "b_1": 20, "back": [0, 13, 19, 26], "backend": [0, 1, 2, 5, 6, 7, 8, 9, 13, 16, 20, 23, 26, 28, 29, 32, 34], "background": 8, "backpropag": 29, "backward": [4, 6, 8, 9, 11, 16, 23, 32, 35], "backwardprefetch": 9, "bad": 26, "baddbmm": 11, "baeseong": 29, "bag": 32, "baichuan": 28, "baichuan2": [28, 32], "bandwidth": [28, 29], "barrier": 9, "base": [0, 1, 2, 3, 4, 8, 9, 10, 13, 16, 20, 23, 28, 29, 30, 31, 32], "basekit": [6, 16, 23], "bash": [26, 29], "basic": [13, 32], "batch": [4, 8, 9, 16, 19, 32, 33, 35], "batch_idx": [4, 9, 16], "batch_siz": [4, 6, 8, 9, 16, 19], "batchnorm": [0, 35], "batchnorm1d": 19, "be32": 20, "beam": [28, 32], "becaus": [0, 4, 11, 19, 28, 34], "becom": 28, "been": [0, 1, 4, 5, 8, 19, 23, 26, 32], "befor": [0, 1, 2, 3, 4, 5, 10, 17, 19, 20, 22, 26, 29, 33, 34, 35], "beforehand": [29, 33, 34], "begin": [2, 3, 8, 22], "behavior": [2, 10, 13, 22], "being": [14, 22], "believ": [11, 19], "belong": 18, "below": [4, 6, 8, 11, 13, 14, 18, 19, 20, 22, 24, 26, 29, 32, 34, 37], "benchmark": [4, 31, 36], "benefici": 19, "benefit": [4, 5, 11, 17], "benifit": [33, 34], "bert": 15, "bertmodel": 4, "besid": [8, 28, 29, 32], "best": [0, 2, 11, 17, 28], "beta": 32, "better": [1, 2, 10, 13, 17, 19, 28, 29, 32, 35, 37], "between": [0, 7, 11, 22, 28, 29], "bf16": [0, 2, 24, 28, 32, 37], "bf32": [2, 10], "bfloat16": [0, 2, 5, 13, 23, 24, 32, 35], "bia": [2, 8, 11, 14, 19, 29, 32], "big": 19, "bigscienc": 28, "bilinear": 11, "bin": [0, 4, 26, 31, 32], "binari": [0, 3, 4, 11, 19, 33, 34], "binary_cross_entropi": 11, "binary_cross_entropy_with_logit": 11, "binaryop": 32, "bind": [5, 8, 9, 32], "bio": 31, "bit": 15, "bla": 10, "block": [2, 3, 10, 29, 32, 33, 35], "blocksiz": 29, "bloom": [28, 31], "bmm": 11, "bmp": 19, "bn": 2, "boast": 29, "bodi": 0, "bool": 2, "boost": [4, 12, 31, 32], "both": [1, 2, 4, 5, 7, 15, 17, 19, 28, 29, 30, 32, 34, 35, 37], "bottl": 37, "bottleneck": [28, 29], "bottom": 35, "bound": [28, 32, 37], "box": 4, "bracket": 22, "brain": 29, "branch": 1, "bridg": 8, "brief": 28, "bring": [12, 17, 28, 32, 33], "broad": 12, "broadcast": 16, "broadcast_optimizer_st": 16, "broadcast_paramet": 16, "broader": [29, 32], "broken": 2, "buf": 37, "buffer": 28, "bug": [1, 3, 26, 32, 33, 34], "build": [4, 5, 16, 22, 26, 32, 34], "build_by_per_kernel": 10, "build_ext": 8, "build_internal_debug": 10, "build_opt_level": 10, "build_separate_op": 10, "build_simple_trac": 10, "build_with_sanit": 10, "built": [0, 8, 22, 26, 32, 34], "bwd": 32, "byeongwook": 29, "byte": 2, "c": [0, 1, 5, 6, 8, 11, 22], "c10": [0, 4], "c10d": [6, 9], "cach": [2, 3, 8, 10, 14, 18, 32, 33, 36, 37], "cache_en": 24, "cai": 29, "calcul": [1, 2, 8, 11, 18, 20, 29, 32], "calib_dataload": 29, "calib_dataset": [17, 30], "calib_func": 29, "calibr": [2, 4, 17, 30], "calibration_data_load": 4, "call": [0, 2, 4, 6, 8, 11, 18, 19, 20, 22, 29, 35, 36], "can": [0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 13, 16, 17, 18, 19, 20, 22, 24, 26, 28, 29, 30, 32, 33, 34, 35, 36, 37], "can_cast_train": 35, "candidate_cel": 8, "cannot": [11, 19, 26, 32], "canon": 19, "capabl": [0, 5, 17, 18, 23, 29, 32, 36], "capac": [5, 13, 31], "capsule2": 7, "captur": 36, "card": [6, 10, 19, 28, 32], "care": [8, 22], "case": [0, 2, 4, 5, 8, 9, 10, 12, 19, 26, 29, 32], "cast": [2, 8, 11], "cat": [8, 11, 13], "catch": [4, 17, 18], "categori": [11, 14], "caus": [2, 26, 28, 32, 34], "cc": [0, 4], "ccl": [6, 9, 10, 16, 26, 31], "ccl_blocking_wait": 10, "ccl_root": [24, 26], "ccl_same_stream": 10, "cd": [3, 4], "cdist": 11, "center": [5, 14, 26, 29, 32, 34], "cerr": 4, "certain": [1, 2, 26, 29, 30], "cgf": 8, "cgh": 8, "chain_matmul": 11, "challeng": [28, 29, 32], "chang": [0, 2, 3, 4, 6, 8, 9, 11, 16, 19, 24, 26, 29, 30, 32], "channel": [2, 10, 17, 32, 33], "channels_last": [4, 5, 19], "char": 4, "charact": 3, "chat": 28, "chatglm3": [28, 32], "check": [4, 5, 6, 8, 10, 13, 19, 20, 24, 28, 30, 32, 33, 35], "check_contigu": 8, "check_input": 8, "check_xpu": 8, "checkpoint": [2, 4, 16, 26], "cheng": 29, "child": 22, "children": 20, "chines": 32, "choos": [4, 5, 11, 13, 32], "chosen": [0, 11, 13], "chw": 19, "chwn": 19, "cifar10": 4, "circumst": 11, "cl": 4, "cl_device_not_found": 26, "class": [6, 9, 11, 19], "classif": 9, "claus": 37, "clean": [3, 26, 32], "cleanup": 9, "client": 32, "clone": [3, 22, 37], "close": 19, "cmake": [0, 4, 5, 32], "cmake_cxx_flag": 4, "cmake_have_libc_pthread": 4, "cmake_minimum_requir": [4, 8], "cmake_prefix_path": 8, "cmakefil": 0, "cmakelist": [4, 8], "cmdclass": 8, "cnn": 19, "co": 32, "code": [1, 2, 3, 5, 8, 9, 10, 13, 16, 19, 20, 22, 24, 26, 27, 30, 32, 33, 34, 36, 37], "codegen": 23, "codellama": 28, "collect": [2, 4, 5, 6, 9, 10, 17, 32], "column": [4, 8, 20], "com": [3, 6, 9, 20, 26], "combin": [2, 17], "come": 8, "comma": [10, 34], "command": [3, 4, 6, 8, 16, 20, 23, 26, 32], "comment": [0, 3], "commit": [3, 31], "common": [0, 14], "commun": [5, 6, 7, 9, 10, 32], "compar": [1, 2, 5, 19, 29, 32, 37], "compat": [0, 29, 32], "compens": 16, "competit": 32, "compil": [1, 3, 4, 10, 26, 32], "compile_flag": 8, "compiled_model": 23, "complet": [3, 4, 18, 19, 30, 36], "complex": [0, 28, 29], "complexdoubl": 0, "complexfloat": 0, "complic": [8, 22], "complier": 0, "compon": [8, 10, 27, 28], "compos": [4, 9, 15], "comprehens": [1, 32, 36], "compress": [2, 15, 29], "compression_dim": 29, "compression_dtyp": 29, "compris": 19, "comput": [2, 4, 10, 15, 16, 19, 23, 28, 29, 32, 33, 34, 35], "compute_dtyp": 29, "compute_eng": 13, "concat": [13, 19, 28], "concat_linear": 2, "concaten": [13, 28], "concept": [16, 19], "conclus": 19, "concret": 32, "conda": [26, 32], "conda_prefix": [24, 26], "condit": [27, 32, 35], "conf": 29, "config": [2, 4, 17], "configur": [0, 2, 4, 8, 17, 18, 24, 26, 29, 32, 34], "conflict": [0, 26], "connect": 35, "conserv": 26, "consid": 8, "consider": 17, "consist": [16, 28, 32], "consol": [18, 20, 22], "const": [0, 4, 8], "constrain": 29, "consum": [7, 20], "consumpt": 20, "contain": [0, 3, 6, 7, 10, 13, 29], "content": [29, 30, 32], "context": [2, 3, 7, 8, 11, 22, 28], "contextlib": 20, "contigu": [8, 19, 28, 32], "contiguous_format": 19, "continu": [5, 13, 18, 22, 26, 32], "contribut": 23, "control": [1, 10, 20, 22], "conv": [2, 11, 18, 32, 35], "conv1": 9, "conv1d": [11, 19], "conv2": 9, "conv2d": [2, 9, 11, 19, 32], "conv3d": [11, 32], "conv_binari": 17, "conv_bn": 2, "conv_bn_fold": 2, "conv_relu": 17, "conv_sum_relu": 17, "conv_tbc": 11, "conv_transpose1d": 11, "conv_transpose3d": 11, "conv_unari": 17, "conveni": [8, 11], "convent": 9, "converg": 26, "convers": [2, 4, 11, 17, 29, 32], "convert": [0, 1, 2, 4, 5, 7, 11, 12, 15, 17, 19, 29, 30, 32, 35], "convert_dtype_str2torch": 29, "convert_jit": [4, 17, 30], "convolut": [2, 5, 11, 23, 33, 35], "convolutuon": 2, "convtranspos": 35, "convtranspose2d": 2, "coo": 19, "copi": [0, 3, 5, 7, 19], "copy_": 22, "copyright": [0, 27], "core": [0, 2, 16, 23, 26, 31, 32], "correct": [8, 9, 19, 26], "correspond": [2, 6, 26, 29, 32], "correspondingli": 29, "corrspond": 19, "corrupt": 16, "cosine_embedding_loss": 11, "cosine_similar": 11, "cost": [2, 5, 8, 20, 21], "could": [6, 10, 17, 18, 19, 23, 26, 29, 30, 35], "counter": 2, "counterpart": [2, 32], "cout": 4, "cover": [19, 28], "cpp": [3, 4, 8], "cpp_extens": 8, "cppsdk": 4, "cpu": [1, 2, 4, 5, 6, 20, 24, 26, 31, 32, 35], "cpu_capability_avx512": 0, "cpu_capability_avx512_bf16": 0, "cpu_featur": 0, "cpu_feature_main": 0, "cpuid": 0, "cpuinfo": 0, "cpuoffload": 9, "creat": [2, 3, 4, 7, 8, 14, 17, 23, 26, 30, 32, 35], "credit": 0, "criterion": [4, 11], "critic": [18, 32], "cross": [10, 11, 32], "cross_entropy_loss": 11, "crossentropyloss": 4, "crucial": 26, "cuda": [8, 9, 32], "cudnn": 19, "curdevid": 29, "current": [0, 1, 2, 3, 5, 7, 9, 13, 17, 18, 22, 26, 28, 29, 30, 34, 35, 37], "current_devic": 2, "current_stream": 8, "custom": [1, 2, 5, 8, 10, 13, 14, 20, 28, 32], "cvt_fp32_to_bf16": 0, "cvt_fp32_to_bf16_kernel_fn": 0, "cvt_fp32_to_bf16_kernel_impl": 0, "cvt_fp32_to_bf16_kernel_stub": 0, "cvtfp32tobf16": 0, "cvtfp32tobf16krnl": 0, "cwd": 6, "cxx": [0, 4], "cxx11": 32, "cxx_standard": [4, 8], "d": [3, 4, 8, 11, 35], "d25": 31, "d2h": 26, "d__avx512f__": 0, "d__avx__": 0, "d_bia": 8, "d_candidate_cel": 8, "d_elu": 8, "d_gate": 8, "d_gate_weight": 8, "d_gates_": 8, "d_input": 8, "d_input_g": 8, "d_new_cel": 8, "d_old_cel": 8, "d_old_cell_": 8, "d_old_h": 8, "d_output_g": 8, "d_relu": 8, "d_sigmoid": 8, "d_tanh": 8, "d_tanh_new_cel": 8, "d_weight": 8, "d_x": 8, "dampen": 37, "data": [0, 2, 4, 6, 8, 11, 12, 14, 16, 17, 19, 24, 26, 29, 30, 32, 34, 37], "data_prepare_finish": 18, "data_ptr": 7, "data_typ": 19, "dataload": [4, 6, 9, 16, 20], "dataset": [4, 6, 9, 16, 29], "dataset1": 9, "dataset2": 9, "datatyp": [29, 31, 32], "date": 32, "dcmake_c_compil": 8, "dcmake_cxx_compil": 8, "dcmake_prefix_path": [4, 8], "dcpmm": 31, "dcpu_cap": 0, "dcpu_capability_amx": 0, "dcpu_capability_avx2": 0, "dcpu_capability_avx512": 0, "dcpu_capability_avx512_bf16": 0, "dcpu_capability_avx512_fp16": 0, "dcpu_capability_avx512_vnni": 0, "dcpu_capability_default": 0, "ddp": [2, 5, 9, 32], "ddp_loss": 9, "ddr": 31, "deactiv": 26, "dealloc": 33, "debug": [10, 17, 18, 22, 30], "decid": 28, "declar": [0, 8], "decltyp": 0, "decod": 28, "decompress": 15, "decreas": [2, 26], "dedic": 32, "deep": [6, 7, 9, 11, 13, 15, 16], "deepcopi": 2, "deepspe": [2, 10, 28, 31, 32], "def": [6, 8, 9, 11, 19, 20, 26, 32], "default": [0, 2, 4, 5, 6, 7, 9, 10, 18, 20, 22, 23, 24, 26, 32, 35], "default_weight_observ": [4, 17, 30], "defaultvalu": 10, "defin": [0, 2, 5, 7, 8, 9, 11, 15, 16, 17, 18, 19, 20, 30, 32, 35], "definit": [0, 2, 8], "deinit": 3, "delai": 15, "delayedsc": [2, 15], "deleg": [16, 33], "delimit": 34, "deliv": 17, "deliveri": [33, 34], "demand": 5, "demonstr": [4, 7, 13, 19, 29], "demostr": 24, "dens": 19, "depend": [0, 4, 19, 32], "deploi": [28, 29, 32], "deploy": [2, 4, 29], "deployment_mod": 2, "deprec": [10, 12], "dequant": [14, 32], "desc": 19, "descent": 29, "describ": [5, 6, 11, 19, 29], "descript": [2, 5, 9, 10, 18, 19, 25, 35], "descriptor": 32, "design": [3, 9, 11, 13, 14, 19, 29, 30, 32, 33, 35], "desir": [4, 17], "destroy_process_group": 9, "detach": 37, "detail": [0, 2, 3, 4, 5, 6, 8, 10, 11, 12, 13, 17, 19, 24, 25, 28, 29, 31, 32, 35], "detect": [0, 1, 4], "determin": [0, 29], "dev": 20, "dev_p_0": 20, "develop": [1, 4, 8, 26, 33, 34], "devic": [1, 2, 4, 5, 6, 7, 8, 9, 10, 13, 14, 16, 18, 19, 21, 23, 25, 26, 28, 29, 30, 32, 33, 34, 35], "device_count": [9, 20], "device_id": [6, 7, 9, 20], "device_map": 29, "devid": 16, "diagram": 19, "dict": 2, "dictionari": 2, "differ": [0, 1, 2, 4, 6, 7, 10, 19, 20, 28, 29], "difficult": 19, "digit": 9, "dim": [4, 7, 8, 9, 13, 19], "dimens": [2, 8, 13, 19, 35], "dimension": 8, "dir": [0, 10], "direct": 3, "directli": [2, 8, 23, 29], "directori": [1, 3, 8, 10, 26, 30, 32], "disabl": [2, 9, 18, 26, 35], "disable_auto_channels_last": [12, 35], "disable_simple_trac": 22, "disadvantag": [33, 34], "discret": [1, 25, 32], "discrete gpu": 1, "discuss": [3, 8, 19], "dispatch": [1, 8, 32], "displai": 2, "dist": [6, 9, 11], "distinguish": 22, "distribut": [2, 6, 9, 10, 16, 26, 32, 33, 34], "distributeddataparallel": [9, 32], "distributedoptim": 16, "distributedsampl": [9, 16], "div": [19, 32], "divid": [26, 35], "divis": 13, "dkm": 26, "dldevicetyp": 7, "dlmanagedtensor": 7, "dmesg": 26, "dmlc": 7, "dnn": 15, "do": [2, 3, 5, 8, 11, 19, 20, 26, 28, 29], "doc": [1, 3, 17, 30, 32, 35], "dockerfil": 32, "docstr": 3, "document": [0, 4, 5, 8, 10, 30, 32], "doe": [7, 8, 18, 19, 26, 32, 35], "doesn": [2, 19, 20], "domain": [7, 15], "domin": [1, 28], "don": [0, 3, 8, 11, 20], "done": [0, 4, 9], "dongsoo": 29, "dot": [11, 19, 28], "doubl": [0, 8], "down": [2, 26], "download": [4, 9, 20, 23, 26], "downstream": 11, "dpc": [1, 7, 10, 26, 32], "dpccp": 2, "dpcpp": [8, 23, 26, 32], "dpcppbuildextens": 8, "dpcppextens": 8, "dpcpproot": 23, "drawback": 2, "drive": [1, 28], "driver": [5, 26, 31, 34], "dropout": [2, 9, 32, 33, 35], "dropout1": 9, "dropout2": 9, "dspevd": 32, "dst": 0, "dtype": [0, 2, 4, 11, 13, 17, 20, 23, 24, 29, 30, 32, 35], "due": [0, 1, 11, 17, 26, 28, 29, 32], "dummi": 10, "dump": [17, 32], "durat": 26, "dure": [2, 3, 4, 28, 29, 34, 35], "dynam": [1, 4, 15, 26], "e": [0, 1, 4, 8, 10, 11, 17, 19, 25, 26, 28, 32, 33, 34], "e4m3": 15, "e5m2": 15, "each": [0, 2, 5, 6, 8, 9, 10, 11, 13, 16, 18, 20, 22, 35], "eager": [1, 17, 32], "earlier": 26, "eas": [8, 19], "easi": [1, 16, 25, 32], "easier": [19, 23], "easiest": 29, "easili": 29, "ec33277": 31, "ecc": 31, "ecolog": 14, "edit": 3, "effect": [0, 16, 17, 29], "effici": [1, 6, 8, 9, 15, 23, 28, 29, 32, 37], "effort": [20, 29], "either": [2, 5, 6], "elabor": 8, "elaps": 9, "elapsed_tim": 9, "element": [10, 19, 37], "eleutherai": 28, "elia": 29, "elimin": 28, "els": [0, 6, 19, 20, 37], "elu": [8, 32], "embed": [2, 28, 32], "emerg": [1, 28], "emit": 8, "empir": 13, "empow": [5, 23], "empti": [7, 18, 19, 22], "empty_cach": [2, 36], "empty_strid": 22, "emul": 32, "enabl": [1, 2, 4, 5, 6, 10, 11, 15, 17, 19, 20, 24, 28, 29, 32, 33, 34, 35], "enable_auto_channels_last": [12, 35], "enable_simple_trac": 22, "enable_tim": 9, "enable_wrap": 9, "encount": [26, 33, 34], "end": [2, 4, 18, 20, 22, 26, 29, 32, 33, 34, 35], "endif": 0, "endl": 4, "enforc": 10, "engin": [1, 4, 19, 25, 26, 29, 32], "enhanc": [1, 23, 28, 29, 32], "enough": [2, 26], "ensur": [4, 16, 26], "entir": [8, 28], "enum": 2, "enumer": [2, 4, 7, 9, 16], "env": [6, 16, 20, 23], "env_key1": 3, "env_key2": 3, "env_val1": 3, "env_val2": 3, "environ": [0, 3, 5, 6, 9, 10, 16, 18, 24, 26, 28], "epoch": [6, 9, 16], "eq": [9, 32], "equal": [10, 26], "equival": [8, 29, 32, 37], "err": 18, "error": [2, 3, 4, 8, 9, 18, 19, 26, 29, 32], "especi": [2, 3, 8], "essenti": 8, "et": 29, "etc": [0, 2, 3, 5, 14, 18, 24, 26], "eval": [2, 4, 9, 11, 17, 24, 30, 32], "evalu": [2, 24, 32], "even": [2, 32], "event": [2, 9], "event_id": 18, "ever": 23, "everi": [6, 22, 28], "exactli": [6, 8], "examin": [20, 37], "exampl": [2, 3, 5, 10, 11, 13, 16, 18, 19, 20, 22, 24, 25, 28, 29, 30, 32, 37], "example_ddp": 6, "example_input": [17, 30], "except": [2, 6, 28], "exclud": 20, "exclus": [6, 9, 10], "execut": [0, 2, 4, 5, 7, 8, 10, 11, 13, 18, 20, 21, 26, 32, 33, 34, 35, 37], "exist": [1, 3, 17, 20, 26, 29, 32], "exit": [26, 32], "exmapl": 22, "exp": [8, 32], "expand": 28, "expect": [19, 26], "expens": 19, "experi": [3, 13, 19, 28, 29, 32], "experiment": 3, "explain": [0, 19, 29], "explicit": [6, 22], "explicitli": [2, 4, 8, 10, 11, 20], "explor": 29, "expon": 15, "export": [10, 18, 22, 24, 26, 32], "export_chrome_trac": 20, "export_compressed_model": 29, "expos": [5, 11], "express": 19, "ext": 4, "ext_modul": 8, "extend": [1, 5, 7, 23, 25, 26, 28, 29, 32], "extens": [2, 4, 7, 9, 10, 12, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, 34, 35, 37], "extern": 7, "extra": [2, 6, 32], "extract": 7, "f": [3, 4, 9, 16], "f16c": 0, "f32": [0, 19], "facebook": 28, "facilit": 23, "fact": [8, 19], "fail": [2, 26, 32, 35], "failur": [26, 32], "falcon": 2, "fall": 13, "fallback": 32, "fals": [0, 2, 4, 9, 11, 17, 19, 20, 22, 24, 29, 30], "famili": [2, 28, 32], "familiar": [2, 8], "far": 7, "fast": [2, 8, 16, 29, 33], "faster": 11, "fastest": 0, "fatal": [26, 32], "fatal_error": [4, 8], "fault": 32, "fc1": 9, "fc2": 9, "feasibl": [5, 9], "featur": [1, 3, 4, 10, 11, 14, 19, 23, 25, 26, 32, 33, 34, 35], "feb": 32, "fed": 6, "feed": [2, 12, 19], "feedforward": 28, "few": [3, 12, 19], "fft": 32, "fft_fft": 11, "fft_fft2": 11, "fft_fftn": 11, "fft_hfft": 11, "fft_ifft": 11, "fft_ifft2": 11, "fft_ifftn": 11, "fft_ihfft": 11, "fft_irfft": 11, "fft_irfft2": 11, "fft_irfftn": 11, "fft_rfft": 11, "fft_rfft2": 11, "fft_rfftn": 11, "fi": 26, "field": [5, 20, 22], "figur": [1, 7, 28], "file": [0, 2, 3, 4, 8, 10, 11, 18, 19, 20, 26, 32, 34], "filenam": 3, "fill": 8, "filter": 7, "final": [9, 33, 34], "find": [1, 4, 7, 8, 20, 26, 31, 32, 33], "find_packag": [4, 8], "findavx": 0, "fine": [8, 29, 32], "finer": 1, "finish": [4, 13, 18, 26], "first": [3, 4, 8, 12, 16, 17, 20, 29, 32], "firstli": [23, 28, 29], "fit": [3, 29, 32, 33], "five": 18, "fix": [2, 3, 32], "flag": [0, 20, 35], "flagship": [5, 23], "flash": 32, "flatten": 9, "flavor": 8, "flex": [5, 26, 32, 34], "float": [0, 2, 4, 5, 8, 9, 11, 15, 35], "float16": [2, 5, 23, 24, 29, 30, 32], "float32": [2, 20, 24], "float64": 11, "flow": 7, "flush": 2, "fly": 8, "fma": 0, "fmax": 8, "fmin": 8, "fmt": 18, "fn_type": 0, "focu": [2, 19, 28, 29, 30, 32], "focus": 32, "fold": 2, "folder": 3, "follow": [0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, 15, 16, 17, 18, 19, 20, 22, 23, 24, 26, 27, 28, 30, 32, 35], "footprint": [5, 9, 15, 28, 32], "forc": 32, "forget": 20, "fork": 0, "format": [2, 3, 5, 6, 7, 9, 12, 13, 15, 16, 18, 20, 22, 29, 32, 35], "format_tag": 19, "former": [4, 8], "formerli": [5, 6, 9], "forth": 16, "forward": [2, 4, 6, 8, 9, 11, 13, 19, 35], "found": [1, 4, 17, 19, 30, 32], "foundat": 19, "four": 18, "fp16": [0, 13, 14, 24, 28, 29, 31, 32], "fp32": [0, 2, 10, 13, 14, 17, 24, 28, 32, 37], "fp32_math_mod": 2, "fp32_math_mode_max": 2, "fp32_math_mode_min": 2, "fp32mathmod": 2, "fp64": [10, 26, 32], "fp8": [2, 5, 32], "fp8_autocast": [2, 15], "fp8_group": 2, "fp8_model": 15, "fp8_recip": [2, 15], "fp8linear": 15, "fpmath": 2, "fpmath_mod": 2, "fragment": 2, "framework": [5, 7, 10, 16, 32], "frantar": 29, "free": [17, 18], "freed": [2, 36], "freez": [4, 11, 23, 24], "frequenc": 31, "friendli": 26, "frobenius_norm": 11, "from": [0, 1, 2, 3, 4, 5, 7, 8, 9, 11, 15, 16, 17, 18, 19, 20, 21, 22, 26, 28, 29, 30, 32, 33, 34, 35, 36], "from_blob": 4, "from_dlpack": 7, "from_pretrain": [4, 29, 32], "front": 29, "frontend": [1, 2, 5, 28, 29, 32], "fsdp": 32, "fsdp_main": 9, "fsdp_mnist_xpu": 9, "fsycl": [4, 8, 34], "full": [3, 8, 23, 28, 32], "fulli": [0, 3, 17, 30, 32], "fully_sharded_data_parallel": 9, "fullyshardeddataparallel": 9, "function": [0, 2, 3, 4, 5, 8, 9, 11, 17, 18, 20, 22, 24, 26, 28, 29, 30, 32, 37], "functool": 9, "further": [1, 2, 4, 5, 18, 19, 28, 29], "fuse": [2, 28, 29, 32, 33, 35, 37], "fuse_update_step": 2, "fusion": [1, 2, 4, 17, 23, 30, 32, 33, 35], "futur": [3, 12], "fw": 31, "fwd": 32, "g": [0, 10, 11, 17, 19, 26, 28, 32, 33, 34], "gain": [1, 28], "gamma": 9, "gate": 8, "gate_weight": 8, "gates_row": 8, "gcc": [0, 32], "gdb": 22, "ge": 32, "geglu": 14, "gelu": 32, "gemm": [19, 28, 32], "genai": [1, 28], "gener": [0, 1, 3, 4, 6, 7, 8, 13, 17, 18, 19, 28, 29, 30, 32, 34, 35], "get": [0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 18, 20, 28, 32], "get_cpp_typesize_and_vecs": 0, "get_cpp_typesize_and_vecsize_kernel_fn": 0, "get_cpp_typesize_and_vecsize_kernel_impl": 0, "get_cpp_typesize_and_vecsize_kernel_stub": 0, "get_devic": 7, "get_fp32_math_mod": 2, "get_group": 8, "get_group_rang": 8, "get_local_id": 8, "get_log_compon": 18, "get_log_level": 18, "get_log_output_file_path": 18, "get_log_rotate_file_s": 18, "get_log_split_file_s": 18, "get_rank": 6, "get_world_s": 6, "getcurrentxpustream": [4, 8], "getveclength": 0, "getveclengthkrnl": 0, "girl": 29, "git": 3, "github": [1, 3, 6, 7, 9, 11], "given": [2, 28], "glibcxx": 32, "glibcxx_use_cxx11_abi": 26, "global": 6, "gnu": 0, "go": [2, 3, 4, 8, 11, 33, 34], "goal": 16, "goe": 8, "good": [1, 2, 3, 19, 28, 37], "googl": 3, "gpt": [2, 28, 29, 31, 32], "gptq": [2, 29], "gpu": [1, 2, 3, 4, 8, 10, 13, 14, 15, 16, 20, 21, 24, 25, 26, 31, 34, 35, 36], "grad": [32, 37], "grad_cel": 8, "grad_h": 8, "grade": [5, 9, 32], "gradient": [5, 9, 15, 16, 29], "grain": 1, "graph": [1, 2, 5, 11, 17, 23, 32, 35], "graph_for": [17, 30], "graph_mod": 2, "graphic": [26, 29, 32, 34], "greater": [13, 26], "grid": 8, "grid_sampl": 11, "group": [2, 6, 8, 9], "group_siz": 29, "gru_cel": 11, "gt": 32, "guarante": 13, "guard": 16, "guid": [0, 6], "guidanc": 5, "guidelin": 19, "gunho": 29, "h": [3, 4, 8, 18, 19, 29, 32], "h2d": 26, "ha": [0, 1, 2, 4, 5, 7, 8, 13, 19, 23, 26, 29, 32, 34], "had": 26, "half": [0, 2], "hand": 8, "handcraft": 29, "handl": [2, 4, 7, 19, 32], "handler": [8, 20], "handwritten": 9, "har": [5, 23], "hard": [7, 19], "hardsigmoid": 32, "hardswish": 32, "hardtanh": 32, "hardwar": [0, 1, 5, 25, 28, 32], "has_2d_block_arrai": 29, "hav": 0, "have": [0, 1, 2, 3, 4, 6, 7, 8, 12, 13, 17, 19, 20, 22, 24, 26, 27, 28, 29, 34], "hbm": 32, "he": 29, "header": [0, 8], "heavier": 28, "height": 19, "held": 2, "help": [0, 2, 3, 4, 8, 9, 13, 28, 33, 34, 36], "helper": 8, "henc": [17, 32], "here": [0, 3, 6, 7, 8, 9, 11, 18, 19, 20, 22], "heurist": 2, "hf": 28, "hgemm_bias_wint4_arc": 29, "hgemm_int4_common_dispatch": 29, "hgemmxetla_int4": 29, "hidden": [19, 28], "hierarchi": 5, "high": [8, 28, 29, 32, 37], "higher": [0, 19, 28], "highest": 13, "highli": [8, 13, 17, 24, 28, 29, 32], "hinge_embedding_loss": 11, "histor": 2, "histori": 28, "hmem": 26, "hold": [6, 9, 19], "holder": [6, 18], "home": [8, 16, 26], "hook": 29, "horovod": [5, 26, 32], "host": [7, 10, 20, 26, 29, 31, 32], "hour": 26, "how": [0, 1, 4, 5, 6, 7, 8, 18, 19, 24], "howev": [2, 3, 5, 8, 10, 11, 12, 19, 28, 29, 32, 36], "hpp": 4, "html": [3, 7, 8], "http": [3, 6, 7, 8, 9, 20, 23, 26], "hub": [28, 29], "huber_loss": 11, "huggingfac": [28, 29], "human": 2, "hvd": [16, 26], "hw": [19, 34], "hwc": 19, "hwio": 19, "hwn": 19, "hypeparamet": 29, "hyper": 31, "hyperparamet": 29, "i": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 37], "icpx": [4, 8], "icx": [4, 8], "id": [6, 7, 9, 22, 28], "ideal": 13, "ideep": [0, 19], "ident": [2, 4, 19, 33, 35], "identif": [0, 4], "identifi": 18, "idx": 28, "ifwi": 31, "ignor": 35, "illustr": [5, 6, 9, 17, 19], "imag": [11, 19, 32], "immintrin": 0, "impact": 2, "imper": 5, "impl": [0, 8, 32], "implement": [1, 3, 4, 5, 6, 7, 8, 9, 19, 28, 29, 32, 37], "implicit": [2, 32], "import": [0, 1, 2, 3, 4, 6, 8, 9, 15, 16, 17, 19, 20, 22, 23, 24, 26, 28, 29, 30, 32], "importerror": [6, 26, 32], "impos": 29, "impress": 29, "improp": 26, "improv": [1, 11, 15, 28, 29, 32, 35], "inact": 2, "inactive_split": 2, "inactive_split_byt": 2, "inc": [0, 28], "inc_model": 29, "includ": [0, 1, 2, 4, 8, 10, 14, 20, 24, 26, 27, 28, 29, 31, 32, 33, 35], "include_dir": 8, "include_path": 8, "incorrectli": [26, 32], "increas": [1, 2, 16, 20, 26, 28, 29, 32, 33, 34, 36], "increment": 8, "inde": [8, 26], "indent": [20, 22], "index": [3, 6, 7, 8, 9, 19, 22, 23, 28, 32], "index_put": 11, "indic": 19, "indirect": 28, "individu": [2, 3], "inductor": [5, 23, 32], "industri": [5, 9, 32], "ineffici": 8, "infer": [2, 5, 14, 15, 17, 19, 23, 24, 32, 35], "inferenc": 2, "inference_data": [17, 30], "inference_dta": [17, 30], "influenc": 28, "info": [0, 4, 17, 18], "inform": [0, 1, 2, 3, 5, 6, 7, 8, 9, 10, 17, 18, 19, 20, 24, 28, 29, 32, 33], "ingredi": 19, "init": [3, 16], "init_end_ev": 9, "init_process_group": [6, 9], "init_start_ev": 9, "initi": [6, 9, 16, 26], "inlin": 0, "inp": 2, "inplac": [2, 17, 19, 29, 30], "input": [0, 2, 4, 6, 8, 9, 12, 13, 17, 18, 19, 20, 23, 29, 31, 32], "input_g": 8, "input_id": [15, 29], "input_mask": 15, "input_ptr": 4, "input_tensor": 20, "input_xpu": 19, "insert": [4, 17, 30], "insid": [3, 8, 18, 33, 35], "instal": [3, 4, 9, 10, 20, 23, 24, 25, 26, 28, 32, 34], "instanc": 35, "instanti": 8, "instead": [10, 21, 26, 30, 32, 37], "instruct": [0, 1, 3, 4, 24, 25, 26, 28, 29, 32], "int": [0, 2, 4, 6, 8, 9], "int32": [2, 32], "int4": [2, 29, 32], "int4_fullrang": 29, "int8": [0, 1, 5, 17, 29, 30, 32], "integ": [2, 29], "integar": 18, "integr": [14, 26, 32, 34], "intel": [2, 5, 7, 8, 9, 10, 12, 13, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 30, 32, 33, 34, 35, 37], "intel discrete gpu": 1, "intel optim": 1, "intel64": [4, 26, 32], "intel64_lin": 4, "intel_extension_for_pytorch": [0, 1, 2, 3, 4, 6, 7, 8, 9, 15, 16, 17, 19, 20, 22, 23, 24, 26, 29, 30, 32], "intel_extension_for_transform": 29, "intelllvm": 4, "intel\u00ae extension for pytorch*": 1, "intend": 3, "intens": 23, "intent": 3, "interest": 3, "interfac": [3, 4, 8, 19, 28, 29], "intern": [0, 2, 10, 19], "interopar": 5, "interoper": 7, "interpret": [2, 8], "intervent": 11, "intrins": 0, "introduc": [1, 8, 19, 32], "introduct": 28, "introductori": 8, "intuit": 29, "invalid": [19, 26, 32], "investig": 26, "invoc": [1, 26, 32], "invok": [2, 4, 8, 11, 24, 26, 30, 32], "involv": 20, "io": 7, "iostream": 4, "ipex": [0, 1, 2, 4, 5, 8, 12, 18, 22, 24, 26, 28, 29, 30, 32], "ipex_declare_dispatch": 0, "ipex_define_dispatch": 0, "ipex_event_log": 18, "ipex_fp32_math_mod": 10, "ipex_fused_optimizer_list_xpu": 35, "ipex_gpu_root_dir": 10, "ipex_info_event_end": 18, "ipex_info_log": 18, "ipex_log": 10, "ipex_log_compon": [10, 18], "ipex_log_level": [10, 18], "ipex_log_output": [10, 18], "ipex_log_rotate_s": [10, 18], "ipex_log_split_s": [10, 18], "ipex_op_regist": 29, "ipex_register_dispatch": 0, "ipex_simple_trac": [10, 22], "ipex_verbos": 10, "ipex_weight_convert_module_xpu": 35, "ipex_xpu_sync_mod": 10, "ipex_xxx_event_end": 18, "ipex_xxx_log": 18, "ipextransformerattnoptimizedint4": 29, "ipextransformerlinear": 29, "ipextransformermlpoptimizedint4": 29, "irc_na": 20, "irfft": 32, "is_contigu": [8, 19], "is_contiguous_channels_last_1d": 19, "is_xpu": 8, "isa": 1, "isa_codegen": 0, "isa_nam": 0, "isacodegen": 0, "issu": [1, 3, 11, 28, 29], "item": [8, 9, 16], "iter": [2, 6, 20, 28], "its": [0, 4, 6, 7, 10, 11, 13, 14, 20, 29], "itself": [2, 19, 20], "itt": 10, "ivalu": 4, "j": [0, 2, 28, 29, 31, 32], "ji": 29, "jit": [1, 2, 4, 11, 17, 19, 24, 30, 32, 33, 34], "job": [3, 26], "join": 9, "json": [2, 20], "jung": 29, "jupyt": 3, "just": [2, 8, 29, 30, 32, 33, 34], "k": [2, 29], "kcpu": 0, "kdloneapi": 7, "kdlsycl": 7, "keep": [3, 19], "keepdim": [8, 9], "kei": [2, 5, 28, 32, 33], "kept": 28, "kera": 16, "kernel": [1, 2, 5, 8, 10, 13, 18, 20, 23, 26, 28, 29, 31, 32, 35], "kernel_s": 19, "key_averag": 20, "kfloat": 4, "kfn": 8, "kill": 26, "killer": 26, "kim": 29, "kineto": [21, 32], "kl_div": 11, "kmd": 26, "knob": 2, "know": [3, 7, 8, 33, 34], "knowledg": 29, "known": [5, 6, 9, 28], "kv": 32, "kv_cach": 28, "kwarg": 20, "kwon": 29, "kxpu": 4, "l": 6, "l1_loss": 11, "l_intel": 20, "label": [6, 11, 15, 26], "lamb": 37, "lambda": 8, "landscap": [1, 28], "languag": [1, 8, 29, 30, 32], "lapack": 32, "lar": 37, "larg": [1, 2, 5, 9, 18, 26, 29, 30, 32, 37], "large_pool": 2, "last": [8, 10, 32, 33], "latenc": [28, 32], "later": 29, "latest": [1, 6, 7, 16, 20, 23, 25, 26, 28, 31], "latter": 8, "launch": [8, 10, 18, 28, 32], "launcher": 6, "layer": [28, 32, 33, 35], "layernorm": [13, 14, 32], "layout": [2, 5], "ld": [26, 32], "ld_preload": [26, 32], "ldd": 4, "le": 32, "lead": 28, "leak": [10, 32], "leaki": 32, "leaky_relu": 32, "learn": [6, 7, 9, 11, 15, 16], "lee": 29, "left": 28, "len": [0, 4, 9, 16, 20], "length": [3, 31], "less": [2, 5, 10, 11, 26, 32], "let": [8, 19, 22, 37], "level": [2, 5, 8, 10, 19, 28, 29, 32, 34], "leverag": [1, 23], "lib": [4, 10, 18, 26, 32], "libc10": 4, "libfabr": 26, "libimf": 4, "libintel": 4, "libintlc": 4, "libirng": 4, "libmkl": [26, 32], "libmkl_cor": [4, 26, 32], "libmkl_gnu_thread": [4, 26], "libmkl_intel_ilp64": [26, 32], "libmkl_intel_lp64": [4, 26, 32], "libmkl_sequenti": 32, "libmkl_sycl": [4, 26, 32], "libmkl_vml_avx512": 26, "libopencl": 4, "libpytorch": 4, "libpytorch_path": 4, "librari": [0, 1, 4, 5, 6, 7, 8, 9, 10, 13, 14, 18, 20, 22, 32], "libstdc": 26, "libsvml": 4, "libsycl": 4, "libtorch": [4, 32], "libtorch_cpu": 4, "licens": 0, "lifecycl": [33, 34], "lighter": 11, "like": [1, 2, 3, 8, 17, 18, 20, 26, 28, 29, 32], "limit": [3, 11, 19, 28, 29, 32], "lin": 29, "linalg_multi_dot": 11, "line": [3, 8, 19, 20, 22], "linear": [2, 6, 9, 11, 13, 15, 19, 32, 33, 35], "linear_bn": 2, "linear_bn_fold": 2, "linear_unari": 17, "link": [0, 1, 4], "linker": [26, 32], "linux": [0, 4, 8, 26, 32], "list": [3, 4, 10, 11, 19, 20, 25, 30, 32, 34], "littl": 29, "live": 3, "ll": [3, 8, 20, 22], "llama": [2, 28, 32], "llama2": [28, 31], "llama3": 28, "llm": [1, 2, 26, 30, 32], "lltm": 8, "lltm_backward": 8, "lltm_backward_xpu": 8, "lltm_forward": 8, "lltm_forward_xpu": 8, "lltm_xpu": 8, "lltm_xpu_backward": 8, "lltm_xpu_backward_kernel": 8, "lltm_xpu_forward": 8, "lltm_xpu_forward_kernel": 8, "lltm_xpu_kernel": 8, "lmkl_core": [26, 32], "lmkl_intel_ilp64": [26, 32], "lmkl_sycl": [26, 32], "lmkl_tbb_thread": [26, 32], "load": [0, 1, 2, 4, 8, 20, 26, 32, 35], "load_in_4bit": 29, "load_state_dict": 2, "loaded_model": 29, "loader": 6, "local": [6, 9, 16], "local_rank": [6, 9, 16], "localhost": 9, "locat": [0, 3, 23], "log": [4, 5, 9, 10, 22, 32], "log_compon": 18, "log_interv": 16, "log_level": 18, "log_path": 18, "log_sigmoid": 32, "log_softmax": [9, 11], "logic": [9, 19, 22, 26], "logutil": 18, "long": [8, 19, 28], "look": [3, 4, 8, 19, 29], "loop": [2, 3, 10, 20], "lora": [28, 32], "loss": [2, 4, 6, 9, 11, 16, 19, 23, 28, 29], "loss_fn": 6, "loss_funct": 23, "lot": [26, 28, 32, 33, 34], "low": [5, 6, 8, 24, 32, 37], "low_precision_checkpoint": 2, "lower": [0, 11, 17, 28, 29, 32], "lp64": 32, "lr": [4, 6, 9, 11, 23, 37], "lr_schedul": 9, "lsb": 0, "lstm": [2, 13], "lt": [31, 32], "lv": 29, "m": [6, 8, 9, 16], "m150": 34, "machin": [0, 3, 6, 20, 29], "macro": [0, 5, 18], "made": [3, 7, 32], "mai": [0, 1, 2, 3, 7, 8, 11, 12, 13, 17, 19, 20, 26, 29, 32], "main": [1, 3, 4, 9, 23, 28, 29], "main_work": 6, "maintain": [5, 6, 8, 9, 11], "major": 29, "make": [0, 2, 3, 4, 5, 6, 8, 9, 16, 20, 23, 24, 28, 29, 32, 34], "make_tupl": 0, "malloc_devic": 4, "mamx": 0, "manag": [6, 8, 11, 28], "mani": [3, 5, 8, 20, 32], "manipul": 19, "mantissa": 15, "manual": [2, 5, 13, 19, 35], "manual_se": [6, 9], "map": 19, "margin_ranking_loss": 11, "mark": [4, 18, 20, 29], "mask": [0, 32], "mask_valu": 0, "masked_lm_label": 15, "master": [2, 6, 33, 35], "master_addr": [6, 9], "master_port": [6, 9], "match": [0, 2, 11, 35], "math": [2, 5, 8, 10, 13], "matmul": [11, 14, 29, 32], "matric": [8, 29], "matrix": [1, 23, 25, 32], "mavx2": 0, "mavx512bf16": 0, "mavx512bw": 0, "mavx512dq": 0, "mavx512f": 0, "mavx512fp16": 0, "mavx512vl": 0, "mavx512vnni": 0, "max": [0, 9, 14, 26, 29, 31, 32, 34], "max_job": 26, "max_memory_alloc": [2, 36], "max_memory_reserv": [2, 36], "max_pool2d": 9, "maximum": [0, 2], "maxpool1d": 19, "maxpool2d": 13, "maxpool3d": 13, "mb": [10, 18], "md": [3, 10, 19], "me": 19, "mean": [0, 2, 10, 19, 22, 26, 28, 29, 32], "measur": 2, "mechan": [0, 1, 8, 32], "meet": [5, 15, 35], "meltdown": 31, "memori": [4, 5, 7, 8, 9, 11, 12, 15, 18, 26, 28, 29, 31, 32, 35, 37], "memory_alloc": [2, 36], "memory_format": [4, 5, 19], "memory_reserv": [2, 36], "memory_snapshot": [2, 36], "memory_stat": [2, 36], "memory_stats_as_nested_dict": 2, "memory_summari": 2, "mention": [6, 8, 18], "merg": 32, "merit": 19, "mermori": 2, "messag": [8, 10, 18, 19, 22, 26, 32], "met": 35, "meta": [19, 28], "metavar": 9, "method": [2, 7, 8, 11, 20, 22, 28, 29, 32], "methodologi": [2, 4, 5, 8, 37], "metric": 2, "mfma": 0, "microsoft": 28, "might": [2, 19, 26, 37], "min": [29, 32], "min_num_param": 9, "mind": 19, "mini": [28, 32], "minim": [0, 10, 28, 29], "minimum": 19, "minmax": 29, "minmaxobserv": [4, 17, 30], "minor": [6, 32], "mish": 32, "miss": [3, 26], "mitig": [29, 31], "mix": [2, 4, 8, 28, 32], "mixtur": 11, "mkdir": 4, "mkl": [4, 26, 32], "mkl_dpcpp_root": [26, 32], "mkl_lapack_dspevd": 26, "mkldnn": 19, "mkldnn_util": 19, "mlp": [14, 29], "mm": [8, 11], "mm_bias_int4": 29, "mm_qkv_out_int4": 29, "mm_silu_mul_int4": 29, "mmx": 0, "mnist": 9, "mnist_cnn": 9, "mno": 0, "mode": [1, 2, 3, 5, 10, 19, 24, 26, 32], "model": [1, 2, 5, 6, 9, 10, 11, 12, 13, 15, 16, 17, 23, 24, 26, 30, 31, 32, 35], "model_nam": 29, "model_name_or_path": [30, 32], "model_state_dict": 4, "modelimp": [17, 30], "modeljit": [4, 17, 30], "modif": [6, 9, 16, 17], "modifi": [2, 3, 16], "modul": [0, 1, 2, 4, 5, 6, 8, 9, 11, 14, 17, 19, 26, 29, 30, 32, 33, 35], "modular": [2, 4], "moe": 14, "momentum": [4, 23, 37], "momentum_buffer_list": 37, "monitor": [7, 36], "more": [0, 1, 2, 3, 5, 6, 8, 9, 10, 11, 17, 20, 22, 24, 26, 28, 29, 32, 33, 34, 36, 37], "moreov": [1, 28, 32], "most": [5, 10, 13, 17, 26, 28, 32], "motiv": [2, 4], "move": [4, 17, 19, 24], "mp": 9, "mpi": [16, 26], "mpi_rank": 6, "mpi_world_s": 6, "mpirun": 6, "mse_loss": 11, "mseloss": 6, "mtl": 32, "much": [8, 19, 37], "mul": 32, "mul_": 37, "multi": [6, 24, 32, 34], "multi_margin_loss": 11, "multi_process_spawn": 6, "multidimension": 19, "multiheadattent": 28, "multilabel_margin_loss": 11, "multipl": [0, 2, 3, 5, 6, 11, 19, 23, 26, 28, 32, 34], "multiprocess": [6, 9], "munmap_chunk": 26, "must": [0, 3, 8, 20, 26, 34, 35, 37], "mv": 11, "mxnet": 16, "my": 19, "my_auto_wrap_polici": 9, "my_schedul": 20, "mykernel": 0, "n": [2, 4, 6, 8, 9, 19], "n1": 19, "n2": 19, "name": [0, 8, 13, 20, 22, 29], "name1": 22, "name2": 22, "named_paramet": 16, "namespac": [0, 4, 11], "nan": 0, "nativ": [0, 1, 5, 11, 26, 32, 37], "natur": [8, 19, 28, 29], "nb": 19, "nchw": 5, "nd": 19, "nd_item": 8, "nd_rang": 8, "ndim": 7, "ne": 32, "nearest": [19, 29], "necessari": [6, 9, 16, 19, 20, 22, 26], "necessarili": 14, "necessit": 29, "neck": 37, "need": [0, 2, 3, 4, 5, 6, 8, 9, 16, 17, 19, 20, 22, 24, 30, 32, 33, 34, 35, 37], "neg": 2, "neglig": 19, "neox": 2, "nest": [2, 22], "nesterov": 37, "net": 9, "network": [1, 5, 11, 13, 15, 32], "neural": [1, 5, 13, 15, 32], "new": [0, 3, 15, 19, 29, 30, 32, 33], "new_cel": 8, "new_h": 8, "newer": 1, "newkernel": 0, "newkernelkrnl": 0, "next": [3, 8, 13], "next_sentence_label": 15, "nf4": 29, "nhwc": [5, 32], "nightli": 23, "ninja": 8, "nll_loss": [9, 11, 16], "nll_loss2d": 11, "nll_loss_nd": 11, "nlp": 4, "nn": [2, 4, 6, 9, 11, 13, 19, 33, 35], "no_grad": [4, 9, 23, 24], "node": [31, 32], "non": [2, 3, 10, 11, 19, 29], "noncontigu": 19, "none": [2, 6, 9, 37], "nonzero": 20, "norm": [14, 32], "normal": [1, 4, 9, 16, 20, 28, 29, 33, 35], "note": [0, 2, 3, 6, 7, 8, 9, 12, 14, 16, 19, 26, 28, 29, 32, 34], "noth": 2, "notic": 27, "nov": 32, "now": [2, 5, 8, 19, 23], "nproc": 9, "nuclear_norm": 11, "null": [10, 18], "nullcontext": 20, "num_alloc_retri": 2, "num_oom": 2, "num_replica": [9, 16], "num_work": [6, 9], "number": [1, 2, 3, 4, 6, 8, 9, 16, 18, 20, 22, 26, 31, 32, 37], "numer": [2, 11], "numpi": 7, "nuqmm": 29, "o": [0, 4, 6, 9, 26, 31, 32], "o0": 2, "o1": 2, "o3": 0, "oam": 31, "object": [0, 2, 4, 26, 32], "observ": [4, 12, 17, 30], "obtain": [17, 28, 29, 30], "occasion": 29, "occupi": [2, 36], "occur": [26, 29, 32], "ocl_icd_vendor": [24, 26], "oct": 32, "octob": 2, "oem": 31, "off": [10, 11, 20, 22, 26, 28, 29, 32], "offer": [1, 3, 20, 29, 36], "offici": [3, 17, 20, 32], "offload": 26, "offset": [19, 28], "old_cel": 8, "old_h": 8, "on_trace_readi": 20, "onc": [0, 2, 3, 13, 18, 19, 20, 29, 34, 35], "one": [2, 3, 6, 7, 8, 10, 13, 16, 17, 18, 19, 20, 26, 30, 32, 37], "oneapi": [4, 5, 6, 7, 9, 13, 16, 23, 26, 31, 32, 34], "oneapi_root": 6, "oneccl": [5, 9, 10, 32], "oneccl_bind_pt": 6, "oneccl_bindings_for_pytorch": [6, 9], "onednn": [0, 2, 5, 10, 13, 18, 23, 28, 32], "onednn_layout": 13, "onemkl": [10, 13, 18, 26, 32], "ones": [0, 8, 17, 29], "onli": [0, 1, 2, 3, 4, 7, 8, 10, 11, 13, 14, 16, 17, 18, 19, 24, 32, 35], "onlin": [15, 32], "onnx": 29, "onto": 20, "oob": 32, "oom": 26, "op": [3, 9, 10, 13, 18, 22, 23, 28, 29], "open": [1, 26, 32], "opencl": [24, 26, 34], "oper": [1, 2, 4, 5, 7, 8, 10, 11, 17, 18, 20, 21, 22, 23, 29, 32, 33], "opportun": 35, "opportunit": 2, "opt": [0, 2, 6, 23, 28, 31, 32], "optim": [1, 2, 5, 6, 9, 10, 11, 12, 13, 14, 16, 19, 23, 24, 25, 26, 29, 32], "optimize_lstm": 2, "optimize_transform": [2, 32], "optimized_model": 2, "optimized_optim": 2, "optimizer_state_dict": 4, "option": [1, 2, 4, 10, 23, 26, 33, 34, 35], "optioncpu": 10, "optionexperiment": 10, "optiongpu": 10, "order": [0, 2, 7, 13, 15, 19, 22, 26, 32], "ordered_gemm_wint4_config_set_arc": 29, "ordered_gemm_wint4_config_set_pvc": 29, "org": [8, 23], "organ": 19, "origin": [0, 2, 7, 15, 16, 29, 30, 35, 37], "oserror": [26, 32], "other": [0, 2, 5, 7, 9, 11, 13, 14, 15, 16, 17, 18, 19, 24, 26, 28, 29, 32, 36, 37], "otherwis": [2, 9, 10, 29], "our": [3, 4, 8, 13, 17, 28, 29], "out": [2, 4, 7, 8, 11, 20, 22, 26, 29, 32, 37], "outplac": 19, "output": [2, 4, 9, 10, 11, 13, 15, 16, 18, 19, 20, 22, 23, 29], "output_g": 8, "output_tensor": 4, "output_tensor_1": 20, "output_tensor_2": 20, "outsid": 20, "outstand": 3, "over": [2, 3, 5, 8, 10, 11, 12, 19, 32], "overal": 17, "overcom": [28, 29], "overhead": [1, 2, 8, 10, 28, 32, 33, 37], "overrid": 10, "overridden": [0, 2], "overview": [30, 32], "overwrit": 2, "own": [4, 8], "ox": 10, "p": [20, 31], "pack": [2, 7], "packag": [1, 4, 6, 8, 9, 24, 26, 32], "packed_accessor32": 8, "packedtensoraccessor32": 8, "packet": 2, "pad": [11, 19, 32], "page": [4, 6, 9, 20, 30, 31, 32, 35], "parallel": [2, 6, 26, 28, 32], "parallel_for": 8, "param": [2, 37], "paramet": [0, 2, 4, 5, 6, 9, 11, 16, 18, 20, 23, 28, 29, 32, 35, 37], "parameterwrapp": 35, "parent": 7, "park": 29, "pars": [7, 9, 32], "parse_arg": 9, "parser": 9, "part": [4, 8, 11, 18, 19, 20, 33, 35], "partial": 9, "particular": [3, 11, 26, 29, 30, 32], "particularli": [5, 7], "partit": 16, "pass": [0, 1, 3, 4, 8, 20, 26, 29], "patch": 32, "path": [2, 4, 8, 9, 10, 18, 19, 20, 23, 24, 26, 35], "path_to_your_onemkl": 32, "pattern": [8, 17, 19, 30, 32, 36], "pdist": [11, 32], "peak": 2, "peer": 29, "peft": 32, "per": [2, 7, 8, 16, 17, 31], "per_kernel": 10, "per_tensor_symmetr": [4, 17, 30], "percentag": 20, "percentasg": 20, "perchannel": [17, 30], "perf": 19, "perfetto": 20, "perform": [1, 2, 4, 5, 8, 10, 11, 12, 13, 14, 17, 19, 23, 25, 28, 29, 30, 32, 35, 37], "period": 2, "permut": 32, "permutecontigu": 13, "persist": 10, "perspect": [2, 19], "pertain": 0, "phase": [26, 28], "phi": [28, 32], "pick": 3, "pid": 22, "piec": 22, "pin": 16, "pin_memori": [6, 9], "pip": [3, 6, 16, 23, 29], "pixelshuffl": 32, "pl": 10, "place": [2, 6, 11, 18, 28], "plan": 3, "platform": [4, 7, 10, 19, 23, 26, 29, 32, 34, 35], "platinum": 31, "pleas": [2, 3, 5, 6, 9, 10, 18, 20, 21, 24, 29, 32, 34, 35], "plug": 8, "pmi_rank": 6, "pmi_siz": 6, "point": [2, 5, 7, 8, 11, 15, 29], "pointer": [0, 8, 26], "poisson_nll_loss": 11, "polici": [29, 32], "polymorph": 0, "pool": [2, 33], "popular": [1, 7, 28, 29, 31, 32], "popup": 3, "port": [3, 6], "posit": [28, 29], "possibl": [2, 4, 7, 13], "post": [3, 5, 17, 23, 28, 29], "potenti": [23, 26, 35], "pow": [11, 32], "power": [8, 15], "pr": 19, "practic": [5, 8, 28], "pragma": 0, "pre": [14, 23, 28, 29, 34], "prebuilt": [20, 26, 32, 34], "precis": [4, 6, 15, 24, 29, 31, 32], "pred": 9, "predefin": 2, "prefer": [1, 13], "prefetchw": 0, "prefetchwt1": 0, "prefil": 28, "prefix": 8, "preinstal": 20, "preload": [26, 32], "prelu": 11, "prepack": [2, 19, 28], "prepar": [17, 30], "prepare_data": 18, "prepare_fp8": 15, "prepare_jit": [4, 17, 30], "preprint": 29, "prerequisit": [3, 4], "present": 29, "preserv": [28, 29], "prevent": [16, 37], "previou": [18, 19, 32], "previous": 8, "primari": 32, "primit": [2, 10, 32], "principl": [16, 19], "print": [0, 4, 6, 9, 16, 17, 19, 20, 22, 30], "printout": 2, "prior": [2, 24], "prioriti": 13, "probabl": [7, 9, 26], "problem": [26, 37], "procedur": [26, 29], "process": [4, 5, 6, 8, 9, 13, 15, 16, 22, 26, 28, 29, 35], "processgroup": [2, 6, 9], "processor": [32, 37], "produc": [3, 7, 8, 11, 36], "product": [1, 28, 32, 33, 34], "prof": 20, "profil": [10, 32], "profileact": 20, "profiler_setup": 20, "profileract": 20, "program": [1, 2], "progress": [26, 28], "project": [1, 4, 8], "prompt": [28, 29], "proper": 20, "properti": [4, 8], "propos": [3, 19, 28, 29], "propot": 20, "prototyp": [2, 10, 32, 34], "provid": [1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 13, 14, 18, 25, 26, 28, 29, 30, 32, 34, 35, 37], "pseudo": 37, "pt": [2, 4, 9, 29], "pth": 4, "pthread": 4, "pti": [10, 20], "ptr": 32, "public": [8, 32], "publicli": 32, "pull": 3, "purpos": [0, 8], "push_back": 4, "put": [6, 7, 9, 20], "pvc": [31, 34], "py": [3, 6, 8, 9, 10, 26], "pybind11": 8, "pybind11_modul": 8, "pyi": 3, "python": [0, 1, 2, 3, 6, 7, 8, 9, 10, 16, 24, 28, 29, 30, 32, 33, 35], "python_include_dir": 8, "pytorch": [2, 4, 5, 7, 8, 9, 10, 11, 12, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, 34, 35, 36, 37], "qconfig": [4, 17, 30], "qconfig_summary_fil": 2, "qkv": 29, "qmodel": 29, "qscheme": [4, 17, 30], "qualiti": [28, 29, 32, 33, 34], "quant": 2, "quantiz": [1, 4, 14, 30, 32], "quantizaiton": 15, "quantizat": 2, "quantization_config": [2, 29], "quantize_jit": [4, 17, 30], "quantwrapp": [17, 30], "queri": [0, 19], "question": 19, "queue": [5, 10, 18], "quick": [1, 25], "quint8": 4, "quit": [0, 29, 32], "qwen": [28, 29, 32], "r": [3, 31], "rais": 6, "ram": 26, "rand": [4, 11, 19, 23], "randint": 4, "randn": [6, 13, 19, 20, 22], "random": [9, 16, 26], "rang": [1, 6, 8, 9, 15, 16, 17, 20, 30], "rank": [5, 6, 9, 16, 32], "ransform": 29, "rate": [9, 16], "rather": [2, 18, 19], "re": [3, 6, 8, 11], "reach": 32, "read": [3, 8, 29, 37], "readabl": [2, 8], "readi": 7, "readm": 3, "real": [2, 4, 6], "realli": [3, 8], "realloc": 18, "reason": 19, "rebas": 3, "reboot": 26, "receiv": [2, 26, 32], "recent": [19, 29], "recip": [2, 32], "reciproc": 11, "recogn": 7, "recommend": [1, 4, 12, 13, 17, 24, 26, 29, 32], "record": [9, 18, 20], "record_avg_pool": 18, "record_shap": 20, "recurs": 3, "reduc": [1, 2, 5, 9, 15, 20, 28, 29, 32, 37], "reduce_rang": [4, 17, 30], "reduceop": 9, "reducescatt": [9, 32], "reduct": 9, "refer": [0, 1, 2, 6, 8, 9, 10, 12, 13, 19, 20, 24, 25, 32, 34, 35], "referenc": 35, "regard": [6, 19], "regardless": 11, "region": [0, 11], "regist": [0, 1, 10, 29, 32], "registrationcent": 20, "registri": 26, "regress": [3, 12], "regular": 4, "reinstal": 3, "reinterpret": 19, "reinterpret_cast": 0, "relat": [0, 7, 9, 17, 18, 20, 26], "relationship": 22, "releas": [0, 1, 2, 6, 12, 14, 19, 26, 34, 36], "reli": [7, 19], "relianc": 29, "reload": 8, "relu": [9, 19, 32], "remain": [29, 37], "remark": [28, 29, 32], "rememb": [20, 23], "remov": [2, 3, 20, 32], "renorm": 11, "reorder": [2, 19, 28], "reorder_cach": 28, "repeat": [19, 20], "repeatedli": 3, "replac": [2, 3, 6, 9, 17, 29, 32, 33, 35], "replace_dropout_with_ident": 2, "replic": 6, "replica": [5, 6, 9], "repo": [3, 6], "repo_url": 6, "report": [0, 1, 18, 26], "repositori": 6, "repres": [3, 6, 18, 32], "represent": 19, "request": [1, 2, 3, 33], "requir": [2, 3, 4, 6, 7, 8, 10, 11, 13, 15, 17, 19, 24, 26, 28, 29, 30, 32, 35], "reserv": [2, 18, 33], "reserved_byt": 2, "reset": [2, 14], "reset_accumulated_memory_stat": 2, "reset_peak_memory_stat": 2, "reset_peak_stat": 2, "reshap": 8, "resid": 6, "residu": 14, "resiz": 4, "resize_": 22, "resnet50": [10, 20], "resnet50_weight": 4, "resolv": [26, 32], "resourc": [28, 29, 32], "respons": [7, 17, 22, 28], "restor": 16, "result": [1, 2, 8, 19], "retak": 14, "retri": 2, "retriev": 8, "return": [0, 2, 4, 6, 8, 9, 11, 19, 20], "return_tensor": 29, "reus": 2, "review": 29, "rf": 3, "rfc": 19, "rh": 0, "right": [8, 24, 28], "rm": [3, 14, 32], "rmsnorm": 28, "root": [0, 4, 6, 23, 26, 28, 32], "root_rank": 16, "rope": 28, "rotari": 28, "rotat": [10, 18], "roughli": 19, "round": [29, 32], "rounding_bia": 0, "row_limit": 20, "rst": 3, "rtn": 29, "run": [2, 3, 4, 5, 6, 8, 9, 10, 11, 20, 22, 24, 26, 32, 33, 34, 35], "run_benchmark_woq": 29, "rune": 6, "runtim": [0, 1, 2, 5, 7, 8, 9, 11, 18, 20, 24, 26, 32, 34], "runtimeerror": 26, "sacrif": 11, "safe": 7, "same": [0, 6, 8, 9, 18, 19, 26, 28, 32], "sampl": [0, 2, 6, 12], "sample_input": [2, 12], "sampler": [6, 9, 16], "sampler1": 9, "sampler2": 9, "sanit": 10, "save": [2, 3, 4, 9, 15, 16, 19, 32, 35], "save_model": 9, "save_pretrain": 29, "saved_dir": 29, "scalar": 32, "scalar_t": 8, "scalartyp": 0, "scalartypetocpptyp": 0, "scale": [2, 5, 9, 10, 15, 16, 26, 28, 29, 32], "scale_dtyp": 29, "scaled_dot_product_attent": 11, "scatter_add": 11, "scenario": [2, 7, 14, 17, 29, 32], "schedul": [1, 9, 20], "scope": 11, "scratchpad": 10, "screen": 22, "script": [0, 1, 2, 3, 4, 5, 6, 8, 9, 11, 16, 17, 22, 24, 28, 30, 35], "scriptmodul": [17, 30], "sdk": 20, "sdp": 32, "sdpa": [28, 32], "se": 29, "se5c7411": 31, "seamlessli": [5, 23], "search": [1, 3, 28, 32], "sec": 9, "second": [8, 16, 18, 20, 26], "secret": 19, "section": [1, 4, 5, 11, 17, 25, 29, 30, 35], "see": [1, 2, 3, 8, 11, 15, 19, 20, 22, 26, 32, 34], "seed": [6, 9], "seed_numb": 6, "segment": [2, 5, 20, 32], "segment_id": 15, "select": [2, 4, 5, 29, 32], "selector": 7, "self": [6, 9, 11, 19, 20], "self_xpu_time_tot": 20, "semant": 19, "semidefinit": 29, "sep": [0, 10], "separ": [4, 8, 10, 18, 26, 27], "seper": 34, "seq_length": 4, "sequenc": [19, 20, 28], "sequenti": 19, "seri": [5, 14, 26, 29, 32, 34], "serv": 32, "server": [16, 26], "servic": 4, "set": [0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 13, 16, 22, 24, 26, 29, 31, 32, 34, 35], "set_devic": [6, 9, 16], "set_epoch": [6, 9], "set_fp32_math_mod": 2, "set_log_compon": 18, "set_log_level": 18, "set_log_output_file_path": 18, "set_log_rotate_file_s": 18, "set_log_split_file_s": 18, "set_properti": [4, 8], "set_source_files_properti": 8, "setup": [3, 8, 9, 16, 28], "setuptool": 5, "setvar": 6, "sever": [2, 18, 20, 26, 31, 37], "sgd": [2, 4, 6, 11, 16, 23, 32, 35, 37], "sgd_fused_step": 37, "sh": [6, 16, 20, 23, 29], "sha": 0, "shall": [3, 19], "shape": [2, 7, 13, 20, 28], "shard": [28, 32], "share": [1, 3, 5, 6, 7, 8, 26, 32], "shen": 29, "ship": 26, "shorten": 3, "should": [2, 3, 8, 9, 11, 18, 22, 26, 28, 29], "should_profil": 20, "show": [0, 4, 6, 7, 8, 11, 20, 22, 28, 29, 30, 31, 32], "showcas": [15, 29], "shown": [1, 2, 4, 6, 19, 20, 22, 28, 29], "shuffl": [6, 9], "si": 31, "side": [5, 7], "sigmoid": [8, 32], "sign": [15, 20, 29], "significantli": [28, 29], "silu": 32, "similar": [0, 9, 20, 26, 32], "simpl": [2, 5, 8, 10, 11, 19, 23, 29, 32], "simplenet": [11, 23], "simpli": [4, 8], "simplic": 29, "simultan": 35, "sinc": [2, 7, 8, 19, 29, 33, 34], "singl": [9, 16, 18, 28, 31, 32, 37], "single_card": 6, "single_card_dist": 6, "situat": 7, "size": [0, 2, 4, 6, 7, 8, 9, 10, 16, 18, 19, 26, 28, 29, 32, 33, 34], "size_based_auto_wrap_polici": 9, "size_t": 8, "sizeof": 0, "skip": [0, 3, 4, 19, 22, 29, 33, 34], "skip_first": 20, "sleef": 0, "slice": [4, 8, 19], "slot": 31, "slow": 26, "slower": 11, "small": [2, 17], "small_pool": 2, "smaller": [0, 11, 32], "smallest": 33, "smooth_l1_loss": 11, "smoothquant": 28, "snapshot": [2, 36], "snippet": [9, 13, 20, 30], "so": [0, 2, 4, 5, 7, 11, 16, 19, 22, 26, 32, 36, 37], "socket": 31, "soft_margin_loss": 11, "softmax": [13, 32], "softplu": 32, "softwar": 27, "solut": [2, 9, 26, 28, 32], "solv": 37, "some": [0, 2, 3, 5, 8, 9, 11, 13, 19, 24, 26], "someth": 19, "sonsumpt": 20, "soon": [10, 21], "sort_bi": 20, "sourc": [0, 1, 3, 8, 10, 16, 20, 22, 23, 26, 27, 34], "space": [19, 29], "spars": 19, "spawn": [6, 9], "spec": 7, "special": [0, 13, 28], "specif": [1, 2, 4, 5, 7, 10, 13, 14, 16, 18, 19, 28, 32, 33, 35], "specifi": [2, 8, 9, 10, 18, 34, 35], "specifii": 0, "spectr": 31, "speed": [5, 8, 28, 32, 33, 37], "speedup": [11, 13, 28, 32], "sphinx": 3, "spir64_gen": 34, "split": [0, 2, 8, 10, 18, 33, 35], "split_master_weight_for_bf16": 2, "spontan": 19, "sqrt": 32, "squar": [28, 32], "src": 0, "src_data_ptr": 19, "src_md": 19, "src_mem": 19, "sse": 0, "sse2": 0, "sse3": 0, "sse4_1": 0, "sse4_2": 0, "ssse3": 0, "stabil": 11, "stabl": [5, 6, 7, 11, 26], "stack": [11, 18, 22], "stage": [17, 37], "stai": 29, "standard": [1, 8, 28], "start": [1, 2, 3, 4, 6, 16, 18, 20, 22], "stat": 2, "state": [2, 5, 8, 9, 16, 28, 36], "state_dict": [2, 4, 9, 16], "state_s": 8, "statement": [0, 5, 20], "static": [2, 5, 17, 28, 29, 35], "statist": [2, 4, 17, 30], "statu": 0, "std": [0, 4, 8], "stead": 0, "step": [2, 3, 4, 6, 8, 9, 11, 13, 16, 18, 20, 22, 23, 26, 29, 33, 35], "step2": 13, "step3": 13, "step4": 13, "step_id": 18, "step_num": 20, "step_siz": 9, "steplr": 9, "still": [3, 5, 8, 11, 19, 32, 34], "stock": [3, 7, 19, 32], "stop": 26, "storag": 37, "store": [0, 2, 14, 19, 28, 37], "store_tru": 9, "str": [2, 6, 20], "strategi": 8, "stream": [4, 8, 10], "strict": 4, "stride": [7, 11], "stride_c": 19, "stride_h": 19, "stride_n": 19, "stride_w": 19, "string": [2, 9, 18], "structur": [1, 5, 7], "style": [2, 3, 4, 8], "sub": [18, 32], "sub_fold": 3, "subcompon": 10, "subfold": 0, "subgraph": 2, "subject": [0, 27], "submit": [1, 3, 8], "submit_barri": 10, "submodul": 3, "subsequ": [8, 19, 26], "substitut": 29, "success": 4, "successfulli": 6, "suffici": [5, 10, 29], "suffix": [0, 26, 32], "suggest": [1, 19, 20], "suit": 3, "suitabl": 29, "sum": [8, 9, 19, 32], "summari": 2, "summit": 29, "super": [6, 9, 11, 19], "suppli": [11, 19], "support": [0, 2, 3, 4, 6, 7, 8, 10, 13, 17, 18, 21, 23, 25, 26, 28, 30, 32, 34, 35, 37], "sure": [3, 6, 9, 16, 20], "surgeon": 29, "sw": 31, "swap": [17, 26], "switch": [0, 9, 20], "sycl": [1, 5, 7, 10, 13, 18, 32, 33, 35], "sycl_devic": 19, "sycl_queu": 8, "symbol": [26, 32], "symlink": 3, "symmetr": 17, "sync": 3, "synchron": [7, 8, 10, 16, 20], "syngraph": 18, "sysman": 2, "system": [0, 8, 26, 32], "t": [0, 2, 3, 7, 8, 11, 19, 20, 22, 32], "t2": 7, "t_valu": 0, "tabl": [0, 20, 34], "take": [1, 2, 8, 11, 19, 25, 29, 32], "tanh": [8, 32], "target": [0, 4, 8, 9, 16, 26, 32, 33, 34], "target_include_directori": 8, "target_link_librari": [4, 8], "task": [2, 26, 28, 29], "tdr": 26, "tdrdelai": 26, "team": [1, 3], "techniqu": [1, 2, 8], "technolog": [1, 28], "tell": 19, "templat": [8, 13, 18, 28, 32], "temporari": 8, "tenor": 19, "tensor": [0, 2, 4, 5, 7, 8, 11, 13, 17, 20, 28, 32, 36], "tensordot": 11, "tensorflow": [16, 19], "tensoropt": 4, "teq": 29, "term": [8, 27], "termin": 26, "test": [0, 4, 9, 22, 31, 32, 33, 34], "test_": 3, "test_batch_s": 9, "test_input": 19, "test_input_xpu": 19, "test_kwarg": 9, "test_load": 9, "test_loss": 9, "test_weight_norm": 26, "test_weight_norm_differnt_typ": 26, "testnnmethod": 26, "text": 34, "tf32": [2, 10], "tgi": 32, "than": [0, 2, 10, 13, 14, 17, 18, 19, 20, 23, 26, 29, 32], "thank": 3, "thei": [2, 8, 11, 19, 28, 29, 32], "them": [1, 3, 10, 16, 19, 20, 26, 28, 29, 32, 35, 37], "therefor": [14, 17], "thi": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 16, 17, 18, 19, 20, 22, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 37], "thing": 26, "those": [2, 8, 16, 20, 26, 28, 36], "thread": [1, 4, 8, 10, 22, 31], "three": [0, 7, 28], "through": [1, 2, 4, 5, 8, 11, 23, 25, 29, 32], "throughput": [28, 32], "throw": 9, "thrown": [2, 26], "thu": [2, 8, 11, 13, 19, 32], "thudm": 28, "tid": 22, "tile": [0, 5, 6, 28, 31], "time": [0, 2, 3, 5, 8, 9, 18, 19, 20, 21, 26, 28, 29, 37], "timeout": 3, "timestamp": 28, "tip": 0, "tloss": [9, 16], "tmp": [8, 20], "to_channels_last_1d": 19, "to_dens": 19, "to_dlpack": 7, "to_mkldnn": 19, "togeth": 34, "toi": 9, "token": [29, 31], "token_type_id": 15, "tool": 0, "toolkit": [2, 23, 31, 32], "toolset": 0, "top": [32, 35], "toplevel": 3, "topologi": [31, 37], "torch": [1, 2, 6, 7, 8, 9, 10, 11, 13, 16, 17, 18, 19, 20, 22, 24, 26, 29, 30, 31, 32, 35, 37], "torch_ccl": [5, 6], "torch_check": [0, 5, 8, 18], "torch_error": [5, 18], "torch_extens": 8, "torch_extension_nam": 8, "torch_ipex": [0, 2, 29], "torch_ipex_include_dir": 8, "torch_ipex_librari": [4, 8], "torch_librari": 8, "torch_llm_allreduc": [10, 32], "torchdynamo": 1, "torchinductor": [5, 23], "torchscirpt": 2, "torchscript": [1, 2, 5, 24, 37], "torchvis": [4, 9], "total": [2, 8, 20, 31, 32, 36], "totensor": [4, 9], "tpp": 28, "trace": [1, 4, 5, 10, 11, 17, 18, 24, 30], "trace_": 20, "trace_example_on_multi_devic": 20, "trace_fil": 20, "trace_handl": 20, "track": [1, 2], "trade": [11, 28, 29, 32], "train": [2, 6, 9, 15, 16, 17, 19, 24, 29, 30, 32, 33, 35], "train_dataset": [4, 6, 16], "train_kwarg": 9, "train_load": [4, 6, 9, 11, 16], "train_sampl": [6, 16], "trainabl": 29, "transfer": 26, "transform": [2, 4, 9, 14, 19, 31, 32], "transpar": [30, 32], "transpos": [8, 32], "tree": 3, "tri": 35, "trigger": [9, 17, 26, 30, 32, 34], "triplet_margin_loss": 11, "triton": [23, 32], "true": [0, 2, 4, 6, 8, 9, 13, 15, 17, 23, 24, 29, 30, 35], "trust_remote_cod": 29, "try": [2, 3, 4, 6, 18, 26], "tune": [11, 29, 32], "tupl": [0, 2, 6], "turboboost": 31, "turn": 22, "tutori": [3, 4, 6, 8, 9, 20], "two": [2, 5, 7, 8, 15, 18, 28, 33], "txt": [3, 4, 8], "type": [0, 2, 3, 4, 5, 7, 8, 9, 17, 19, 24, 26, 29, 32, 33, 34, 35], "typenam": 8, "types": 0, "typic": [5, 7, 16, 20, 29, 32], "u": [6, 8], "ubuntu": [26, 31, 32], "ucod": 31, "ui": 20, "uint32_t": 0, "uint4": 2, "uint8": 17, "ultim": 8, "ultra": 32, "unabl": 33, "unalign": 0, "unaryop": 32, "uncas": 4, "undefin": [26, 32], "under": [2, 7, 11, 14, 19, 26, 27, 32], "undergo": 30, "underlai": 8, "underli": [0, 1, 5, 26, 28, 36], "underneath": 32, "understand": 36, "ungracefulli": 26, "unifi": [2, 4], "uniform": 29, "uninstal": 3, "unintention": 22, "uniqu": [18, 20, 22], "unit": [1, 8, 32], "unlik": [4, 5, 9, 28, 29], "unlist": 11, "unload": 26, "unlock": 23, "unoccupi": 2, "unpredict": 2, "unquant": 29, "unstabl": 11, "unsupport": [26, 32], "until": [3, 22], "untrack": 3, "unus": [2, 10, 36], "up": [5, 7, 8, 9, 10, 28, 32, 33], "updat": [2, 3, 6, 9, 32, 33, 35, 37], "upgrad": 32, "uplift": 32, "upon": 29, "upper": 19, "upsampl": [13, 19], "upsampleblinear2d": 13, "upsamplenearest": 13, "upstream": 19, "url": [6, 23], "us": [0, 1, 2, 3, 6, 9, 10, 14, 15, 16, 17, 18, 19, 21, 23, 24, 26, 27, 28, 29, 32, 33, 35, 36], "usag": [2, 5, 7, 11, 13, 17, 19, 20, 24, 25, 32], "use_aot_devlist": [10, 34], "use_channels_last_1d": 10, "use_ds_kernel": 10, "use_itt_annot": 10, "use_llm_runtim": 29, "use_onednn_dir": 10, "use_onemkl": [10, 26, 32], "use_optimum_format": 29, "use_persist_stream": 10, "use_primitive_cach": 10, "use_pti": [10, 20], "use_queue_barri": 10, "use_scratchpad_mod": 10, "use_split_fp64_loop": 10, "use_sycl_assert": 10, "use_xetla": 10, "use_xetla_src": 10, "user": [1, 2, 4, 5, 10, 12, 13, 19, 23, 26, 32, 33, 34, 35, 36], "user_nam": 8, "using_simple_trac": 22, "usm": [4, 7], "usr": [0, 26, 32], "usual": [19, 23, 28, 29], "util": [1, 4, 5, 6, 7, 8, 9, 13, 16, 18, 19, 26, 29, 34, 35], "v": [26, 32], "v0": [7, 32], "v1": 32, "v2": [23, 32], "v3": [23, 32], "v4": 31, "valid": [7, 9, 10, 14], "valu": [0, 2, 4, 10, 15, 26, 28, 29], "vanilla": 8, "var": [6, 16, 20, 23], "variabl": [0, 3, 5, 10, 16, 18, 24, 26], "variant": 11, "variou": [7, 20, 23, 28, 29, 32], "vec256": 0, "vec512": 0, "vec_bia": 0, "vector": [0, 1, 2, 4, 8, 14, 19, 32], "vectors": 0, "vehicl": 32, "vendor": [24, 26], "ver": 8, "verbos": [5, 8, 10, 18, 22], "veri": [3, 8, 10, 19, 20, 21, 26, 28], "verif": 32, "verifi": [4, 23, 26, 28], "version": [0, 4, 7, 8, 23, 26, 27, 32, 35, 37], "via": [3, 5, 7, 8, 10, 17, 20, 23, 29, 34, 36], "view": [17, 19, 20, 32], "view_a": 9, "viewer": 20, "virtual": 0, "virtualguardimpl": 8, "visibl": [2, 32], "vision": 4, "vllm": 32, "vml": [26, 32], "vnni": [0, 1, 32], "vocab_s": 4, "void": [0, 8], "w": [19, 29], "w4g32": 29, "w8": 29, "w8a8": [28, 29], "wa": [4, 7, 8, 22, 26, 32], "wai": [3, 8, 19, 28, 29, 37], "wait": [7, 20, 29], "walk": 8, "want": [0, 3, 8, 10, 19, 20, 24], "warmup": [17, 20, 30], "warmup_data": [17, 30], "warn": [3, 18], "warp": 6, "wast": 28, "wc": 19, "we": [0, 1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13, 17, 19, 26, 28, 29, 31, 32, 36, 37], "webpag": 32, "weight": [1, 2, 4, 6, 8, 15, 16, 17, 19, 30, 32, 33, 35], "weight_decai": [23, 37], "weight_dtyp": 29, "weightonlyquantconfig": 29, "weightonlyquantizedlinear": 29, "weights_prepack": 2, "well": [1, 2, 3, 4, 20, 28, 29, 32], "were": 8, "wget": 20, "what": [11, 20, 22, 32, 33, 34], "wheel": [20, 26, 32, 34], "when": [2, 3, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 18, 19, 22, 26, 28, 29, 32, 33, 34, 35, 37], "where": [2, 3, 5, 6, 7, 9, 23], "whether": [2, 10, 11, 19, 20, 35], "which": [0, 1, 2, 4, 5, 6, 7, 8, 10, 11, 15, 18, 19, 22, 26, 28, 29, 32, 34, 36], "while": [10, 11, 17, 19, 20, 28, 29, 32], "whl": [6, 23, 26], "who": [10, 32, 34], "whole": [17, 18, 32], "wide": [7, 29], "wider": [1, 4], "widespread": [1, 28], "width": [0, 19, 28], "window": [26, 32], "wise": [10, 29, 30, 32, 37], "wish": [3, 8, 19, 20], "with_arg": [4, 17, 30], "within": [3, 18, 29, 30, 32, 33, 35], "without": [2, 5, 7, 11, 22, 26, 29, 32, 33], "wn": 19, "won": [0, 2, 11, 20, 22], "woq": 28, "woq_quantization_config": 29, "work": [0, 2, 3, 4, 6, 8, 9, 19, 24, 26, 28, 30, 32], "work_group": 8, "workabl": 2, "workaround": [26, 32], "worker": [5, 6, 9, 16], "workflow": [5, 17], "workgroup": 32, "workload": [1, 4, 5, 11, 17, 26, 28, 30, 32, 33], "workspac": [4, 14], "world": 6, "world_siz": [6, 9], "wors": 29, "worth": 14, "would": [0, 3, 4, 8, 13, 17, 18, 19, 26, 32], "wrap": [6, 9, 16, 35], "wrap_cpp_modul": 4, "wrapper": [6, 8, 22], "wrapper___local_scalar_dens": 22, "wrapper___reshape_alia": 22, "wrapper___unique2": 22, "wrapper__as_strid": 22, "wrapper__clon": 22, "wrapper__copy_": 22, "wrapper__empty_strid": 22, "wrapper__resize_": 22, "wrapper_memory_format_empti": 22, "write": [0, 5, 20], "written": [0, 4, 32, 35], "wrong": [26, 32], "wrongli": 26, "wsl2": [26, 32], "ww42": 31, "x": [0, 1, 4, 8, 9, 11, 19, 25, 29, 34], "x1": 13, "x2": 13, "xcr0": 0, "xdf": 3, "xe": [13, 28, 32], "xe_hpg": 10, "xe_lpg": 10, "xelink": [10, 32], "xelta": 28, "xeon": 31, "xetla": [10, 13, 32], "xmx": [1, 25, 32], "xpu": [1, 2, 5, 6, 7, 9, 10, 11, 13, 14, 16, 17, 18, 20, 22, 23, 24, 25, 26, 29, 30, 31, 35], "xpu_kwarg": 9, "xpu_r": 19, "xpucomputeeng": 13, "xpumalloc": 2, "xpustream": 4, "xsave": 0, "xxx": [18, 32], "y": [4, 11, 29], "ye": 3, "yield": 1, "you": [0, 1, 2, 3, 4, 5, 6, 8, 9, 11, 16, 18, 19, 20, 22, 23, 24, 26, 28, 29, 30, 32, 33, 34, 36], "youngjoo": 29, "your": [1, 3, 4, 6, 8, 9, 11, 16, 20, 22, 23, 24, 26, 27, 29, 33, 34, 36], "your_generation_param": 32, "yourself": 8, "z": [4, 8], "ze_flat_device_hierarchi": 5, "zero": [2, 5, 9, 26, 29, 34], "zero_grad": [4, 9, 16, 23], "zero_point": 17, "zeros_lik": 8, "zhang": 29, "zoo": 4}, "titles": ["Intel\u00ae Extension for PyTorch* CPU ISA Dynamic Dispatch Design Doc", "Intel\u00ae Extension for PyTorch*", "API Documentation", "Contribution", "Examples", "Features", "DistributedDataParallel (DDP)", "DLPack Solution", "DPC++ Extension", "Fully Sharded Data Parallel (FSDP)", "Advanced Configuration", "Auto Mixed Precision (AMP) on GPU", "Auto Channels Last", "Compute Engine (Experimental feature for debug)", "Intel\u00ae Extension for PyTorch* - DeepSpeed* Kernels", "Float8 Data Type Support (Prototype)", "Horovod with PyTorch (Prototype)", "Intel\u00ae Extension for PyTorch* Optimizations for Quantization [GPU]", "IPEX_LOG (Prototype)", "Channels Last", "Kineto Supported Profiler Tool (Prototype)", "Legacy Profiler Tool (Deprecated)", "Simple Trace Tool (Deprecated)", "torch.compile for GPU (Beta)", "Quick Start", "Introduction", "Troubleshooting", "License", "Large Language Models (LLM) Optimizations Overview", "Weight-Only Quantization (Prototype)", "Transformers Optimization Frontend API", "Performance", "Releases", "Technical Details", "Ahead of Time (AOT) Compilation", "ipex.optimize Frontend API", "Memory Management", "Optimizer Fusion on GPU"], "titleterms": {"0": 32, "1": [31, 32], "10": [31, 32], "110": 32, "120": 32, "13": 32, "1550": 28, "1d": 19, "2": 32, "20": 32, "200": 32, "3": 32, "30": 32, "40": 32, "For": 29, "Into": 20, "That": 19, "accessor": 8, "add": [0, 20], "advanc": [5, 10], "ahead": [33, 34], "ai": [4, 31], "all": 19, "amp": [5, 11], "aot": [33, 34], "api": [0, 2, 5, 6, 12, 19, 25, 30, 35], "applic": 20, "arc": 28, "architectur": 1, "asynchron": 7, "aten": [0, 19], "auto": [5, 11, 12], "autocast": 11, "automat": 35, "b": 19, "basic": 4, "behavior": 11, "benchmark": 29, "bert": 4, "beta": [5, 23], "better": 3, "bfloat16": [4, 11], "bind": 6, "block": 19, "build": [0, 3, 8, 10, 20], "c": [2, 4, 18, 19], "c10": 8, "cach": 28, "can": 11, "capsul": 7, "case": [7, 11, 13, 20, 22, 34], "center": [28, 31], "channel": [5, 12, 19, 35], "check": 0, "chrome": 20, "cmake": 8, "code": [0, 4], "codegen": 0, "common": 30, "compil": [0, 5, 8, 23, 33, 34], "compon": 18, "compressor": 29, "comput": [5, 13], "configur": [5, 10, 31], "contribut": 3, "conv_bn_fold": 35, "convers": 19, "convolut": 19, "core": 28, "correspond": 8, "coverag": 19, "cpp": 0, "cpu": [0, 19], "creat": 19, "creation": 19, "csrc": 0, "current": 8, "custom": [0, 4], "d": 19, "data": [5, 7, 9, 15, 28, 31], "ddp": 6, "debug": [0, 3, 5, 13], "deep": 28, "deepspe": [14, 30], "default": [11, 12, 19], "definit": 18, "depend": [23, 26], "deprec": [21, 22], "design": [0, 7], "detail": 33, "determin": 19, "develop": 3, "devic": 20, "disabl": [12, 20, 22], "dispatch": [0, 29], "dispatchstub": 0, "distribut": [5, 28, 30], "distributeddataparallel": 6, "dldevic": 7, "dlpack": [5, 7], "doc": 0, "document": [2, 3, 25], "dpc": [4, 5, 8], "dynam": [0, 6], "dyndisp": 0, "eas": 12, "easi": 5, "elig": 11, "enabl": [12, 22], "engin": [5, 13], "enviorn": 18, "environ": 29, "event": 18, "exampl": [0, 4, 6, 7, 8, 9, 15, 23], "execut": [24, 29], "experiment": 13, "export": [7, 20], "extens": [0, 1, 3, 5, 6, 8, 14, 17, 29], "featur": [0, 5, 13, 29], "fetch": 8, "fine": 28, "float16": [4, 11], "float32": [4, 11], "float8": 15, "folder": 0, "format": 19, "fp16": 30, "fp8": 15, "framework": 29, "from": 6, "frontend": [30, 35], "fsdp": [5, 9], "fulli": [5, 9], "fuse_update_step": 35, "fusion": [28, 37], "gener": [2, 26], "get": 25, "gpu": [5, 6, 9, 11, 17, 23, 28, 29, 32, 33, 37], "graphic": 28, "h": 0, "hardwar": 31, "highlight": 32, "horovod": 16, "i": 19, "imper": [4, 11, 17, 30], "implement": [0, 13], "import": 7, "infer": [4, 11, 28, 29, 30], "inferenec": 23, "initi": 29, "input": 11, "instal": [6, 16, 29], "instanc": 4, "int4": 28, "int8": 4, "intel": [0, 1, 3, 4, 6, 14, 17, 28, 29, 31], "intrin": 0, "introduct": [6, 7, 8, 9, 11, 13, 14, 18, 20, 21, 22, 23, 25, 29, 34, 37], "ipex": [33, 35], "ipex_log": [5, 18], "ipex_simple_trac": 18, "ipex_verbos": 18, "isa": 0, "issu": [12, 26, 32], "jit": 8, "kernel": [0, 4, 14, 19], "kineto": [5, 20], "known": [12, 32], "kv": 28, "languag": 28, "larg": 28, "last": [5, 12, 19, 35], "launch": 6, "layout": 19, "legaci": 21, "level": [0, 18], "librari": 26, "licens": 27, "linear": [28, 29], "linear_bn_fold": 35, "link": 6, "list": 28, "llm": [28, 29, 31], "load": 29, "local": 3, "log": 18, "low": 28, "manag": [2, 33, 36], "manner": 19, "manual": 0, "matrix": 29, "matter": 19, "max": 28, "memori": [2, 19, 33, 36], "methodologi": 28, "mix": [5, 11], "mode": [4, 15, 17, 30], "model": [4, 19, 20, 22, 28, 29], "motiv": 8, "mpi": 6, "multi": 20, "multipl": 13, "nativ": 19, "nchw": 19, "nchw16c": 19, "neural": 29, "nhwc": 19, "node": 6, "oneccl": 6, "onednn": 19, "onli": [6, 9, 28, 29], "op": [8, 11], "oper": [13, 15, 19, 28, 37], "optim": [4, 17, 28, 30, 33, 35, 37], "option": 29, "overview": [0, 28, 31], "parallel": [5, 9], "path": 11, "perform": [26, 31], "platform": 14, "pointer": 7, "polici": [13, 28], "prebuilt": 6, "precis": [5, 11, 28], "primit": 19, "privat": 0, "process": 0, "processor": 28, "product": 31, "profil": [5, 20, 21], "program": 7, "promot": 11, "prototyp": [5, 15, 16, 18, 20, 29], "pseudocod": 30, "pytest": 3, "python": [4, 5, 18], "pytorch": [0, 1, 3, 6, 14, 16, 17, 19, 29], "quantiz": [2, 5, 15, 17, 28, 29], "queue": 8, "quick": 24, "recommend": 6, "refer": [4, 11, 29], "regist": 19, "releas": 32, "replac": 18, "replace_dropout_with_ident": 35, "request": 8, "requir": [0, 23, 34], "resnet50": 4, "result": [20, 22], "run": [15, 29], "runtim": [6, 10, 29], "save": 29, "scale": 6, "scenario": 30, "script": [20, 29], "segment": 28, "select": [0, 13], "set": 18, "setup": 29, "setuptool": 8, "shard": [5, 9], "simpl": [18, 22], "singl": [4, 6], "smoothquant": 30, "softwar": 31, "solut": [5, 7], "sourc": 6, "specif": [0, 11], "split_master_weight_for_bf16": 35, "start": [24, 25], "stride": 19, "struct": 0, "stub": 0, "support": [1, 5, 11, 14, 15, 19, 20, 29], "sycl": [4, 8], "technic": 33, "tensor": 19, "test": [3, 26], "time": [10, 33, 34], "tip": 3, "tool": [5, 20, 21, 22], "torch": [4, 5, 23], "torchscript": [4, 11, 17, 30], "trace": [20, 22], "train": [4, 5, 11, 23], "transform": [29, 30], "troubleshoot": 26, "tune": 28, "type": [11, 15, 28], "ultra": 28, "unit": [3, 26], "us": [4, 5, 7, 8, 11, 12, 13, 20, 22, 34], "usag": [4, 6, 9, 15, 16, 18, 23, 26, 29, 30], "v2": 31, "valid": 28, "vec": 0, "version": 31, "weight": [28, 29], "what": 19, "wheel": 6, "widest": 11, "woq": 29, "write": [3, 8, 19], "xpu": [3, 4, 8, 19, 32], "xpustream": 8, "xyz": 0, "xyzkrnl": 0}}) \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/api_doc.html b/xpu/2.3.110+xpu/tutorials/api_doc.html new file mode 100644 index 000000000..190a71aa8 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/api_doc.html @@ -0,0 +1,792 @@ + + + + + + + API Documentation — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

API Documentation

+
+

General

+
+
+ipex.optimize(model, dtype=None, optimizer=None, level='O1', inplace=False, conv_bn_folding=None, linear_bn_folding=None, weights_prepack=None, replace_dropout_with_identity=None, optimize_lstm=None, split_master_weight_for_bf16=None, fuse_update_step=None, auto_kernel_selection=None, sample_input=None, graph_mode=None, concat_linear=None)
+

Apply optimizations at Python frontend to the given model (nn.Module), as +well as the given optimizer (optional). If the optimizer is given, +optimizations will be applied for training. Otherwise, optimization will be +applied for inference. Optimizations include conv+bn folding (for +inference only), weight prepacking and so on.

+

Weight prepacking is a technique to accelerate performance of oneDNN +operators. In order to achieve better vectorization and cache reuse, onednn +uses a specific memory layout called blocked layout. Although the +calculation itself with blocked layout is fast enough, from memory usage +perspective it has drawbacks. Running with the blocked layout, oneDNN +splits one or several dimensions of data into blocks with fixed size each +time the operator is executed. More details information about oneDNN data +mermory format is available at oneDNN manual. +To reduce this overhead, data will be converted to predefined block shapes +prior to the execution of oneDNN operator execution. In runtime, if the data +shape matches oneDNN operator execution requirements, oneDNN won’t perform +memory layout conversion but directly go to calculation. Through this +methodology, called weight prepacking, it is possible to avoid runtime +weight data format convertion and thus increase performance.

+
+
Parameters:
+
    +
  • model (torch.nn.Module) – User model to apply optimizations on.

  • +
  • dtype (torch.dtype) – Only works for torch.bfloat16 and torch.half a.k.a torch.float16. +Model parameters will be casted to torch.bfloat16 or torch.half +according to dtype of settings. The default value is None, meaning do nothing. +Note: Data type conversion is only applied to nn.Conv2d, nn.Linear +and nn.ConvTranspose2d for both training and inference cases. For +inference mode, additional data type conversion is applied to the weights +of nn.Embedding and nn.LSTM.

  • +
  • optimizer (torch.optim.Optimizer) – User optimizer to apply optimizations +on, such as SGD. The default value is None, meaning inference case.

  • +
  • level (string) – "O0" or "O1". No optimizations are applied with +"O0". The optimizer function just returns the original model and +optimizer. With "O1", the following optimizations are applied: +conv+bn folding, weights prepack, dropout removal (inferenc model), +master weight split and fused optimizer update step (training model). +The optimization options can be further overridden by setting the +following options explicitly. The default value is "O1".

  • +
  • inplace (bool) – Whether to perform inplace optimization. Default value is +False.

  • +
  • conv_bn_folding (bool) – Whether to perform conv_bn folding. It only +works for inference model. The default value is None. Explicitly +setting this knob overwrites the configuration set by level knob.

  • +
  • linear_bn_folding (bool) – Whether to perform linear_bn folding. It only +works for inference model. The default value is None. Explicitly +setting this knob overwrites the configuration set by level knob.

  • +
  • weights_prepack (bool) – Whether to perform weight prepack for convolution +and linear to avoid oneDNN weights reorder. The default value is +None. Explicitly setting this knob overwrites the configuration +set by level knob. Weight prepack works for CPU only.

  • +
  • replace_dropout_with_identity (bool) – Whether to replace nn.Dropout +with nn.Identity. If replaced, the aten::dropout won’t be +included in the JIT graph. This may provide more fusion opportunites +on the graph. This only works for inference model. The default value +is None. Explicitly setting this knob overwrites the configuration +set by level knob.

  • +
  • optimize_lstm (bool) – Whether to replace nn.LSTM with IPEX LSTM +which takes advantage of oneDNN kernels to get better performance. +The default value is None. Explicitly setting this knob +overwrites the configuration set by level knob.

  • +
  • split_master_weight_for_bf16 (bool) – Whether to split master weights +update for BF16 training. This saves memory comparing to master +weight update solution. Split master weights update methodology +doesn’t support all optimizers. The default value is None. The +default value is None. Explicitly setting this knob overwrites +the configuration set by level knob.

  • +
  • fuse_update_step (bool) – Whether to use fused params update for training +which have better performance. It doesn’t support all optimizers. +The default value is None. Explicitly setting this knob +overwrites the configuration set by level knob.

  • +
  • sample_input (tuple or torch.Tensor) – Whether to feed sample input data to ipex.optimize. The shape of +input data will impact the block format of packed weight. If not feed a sample +input, Intel® Extension for PyTorch* will pack the weight per some predefined heuristics. +If feed a sample input with real input shape, Intel® Extension for PyTorch* can get +best block format. Sample input works for CPU only.

  • +
  • auto_kernel_selection (bool) – Different backends may have +different performances with different dtypes/shapes. Default value +is False. Intel® Extension for PyTorch* will try to optimize the +kernel selection for better performance if this knob is set to +True. You might get better performance at the cost of extra memory usage. +The default value is None. Explicitly setting this knob overwrites the +configuration set by level knob. Auto kernel selection works for CPU only.

  • +
  • graph_mode – (bool) [prototype]: It will automatically apply a combination of methods +to generate graph or multiple subgraphs if True. The default value is False.

  • +
  • concat_linear (bool) – Whether to perform concat_linear. It only +works for inference model. The default value is None. Explicitly +setting this knob overwrites the configuration set by level knob.

  • +
+
+
Returns:
+

Model and optimizer (if given) modified according to the level knob +or other user settings. conv+bn folding may take place and +dropout may be replaced by identity. In inference scenarios, +convolutuon, linear and lstm will be replaced with the optimized +counterparts in Intel® Extension for PyTorch* (weight prepack for +convolution and linear) for good performance. In bfloat16 or float16 scenarios, +parameters of convolution and linear will be casted to bfloat16 or float16 dtype.

+
+
+
+

Warning

+

Please invoke optimize function BEFORE invoking DDP in distributed +training scenario.

+

The optimize function deepcopys the original model. If DDP is invoked +before optimize function, DDP is applied on the origin model, rather +than the one returned from optimize function. In this case, some +operators in DDP, like allreduce, will not be invoked and thus may cause +unpredictable accuracy loss.

+
+

Examples

+
>>> # bfloat16 inference case.
+>>> model = ...
+>>> model.load_state_dict(torch.load(PATH))
+>>> model.eval()
+>>> optimized_model = ipex.optimize(model, dtype=torch.bfloat16)
+>>> # running evaluation step.
+>>> # bfloat16 training case.
+>>> optimizer = ...
+>>> model.train()
+>>> optimized_model, optimized_optimizer = ipex.optimize(model, dtype=torch.bfloat16, optimizer=optimizer)
+>>> # running training step.
+
+
+

torch.xpu.optimize() is an alternative of optimize API in Intel® Extension for PyTorch*, +to provide identical usage for XPU device only. The motivation of adding this alias is +to unify the coding style in user scripts base on torch.xpu modular.

+

Examples

+
>>> # bfloat16 inference case.
+>>> model = ...
+>>> model.load_state_dict(torch.load(PATH))
+>>> model.eval()
+>>> optimized_model = torch.xpu.optimize(model, dtype=torch.bfloat16)
+>>> # running evaluation step.
+>>> # bfloat16 training case.
+>>> optimizer = ...
+>>> model.train()
+>>> optimized_model, optimized_optimizer = torch.xpu.optimize(model, dtype=torch.bfloat16, optimizer=optimizer)
+>>> # running training step.
+
+
+
+ +
+
+ipex.llm.optimize(model, optimizer=None, dtype=torch.float32, inplace=False, device='cpu', quantization_config=None, qconfig_summary_file=None, low_precision_checkpoint=None, sample_inputs=None, deployment_mode=True)
+

Apply optimizations at Python frontend to the given transformers model (nn.Module). +This API focus on transformers models, especially for generation tasks inference. +Well supported model family: Llama, GPT-J, GPT-Neox, OPT, Falcon.

+
+
Parameters:
+
    +
  • model (torch.nn.Module) – User model to apply optimizations.

  • +
  • optimizer (torch.optim.Optimizer) – User optimizer to apply optimizations +on, such as SGD. The default value is None, meaning inference case.

  • +
  • dtype (torch.dtype) – Now it works for torch.bfloat16 and torch.float. +The default value is torch.float. When working with quantization, it means the mixed dtype with quantization.

  • +
  • inplace (bool) – Whether to perform inplace optimization. Default value is False.

  • +
  • device (str) – Specifying the device on which the optimization will be performed-either ‘CPU’ or ‘XPU.

  • +
  • quantization_config (object) – Defining the IPEX quantization recipe (Weight only quant or static quant). +Default value is None. +Once used, meaning using IPEX quantizatization model for model.generate().(only works on CPU)

  • +
  • qconfig_summary_file (str) – Path to the IPEX static quantization config json file. (only works on CPU) +Default value is None. Work with quantization_config under static quantization use case. +Need to do IPEX static quantization calibration and generate this file. (only works on CPU)

  • +
  • low_precision_checkpoint (dict or tuple of dict) – For weight only quantization with INT4 weights. +If it’s a dict, it should be the state_dict of checkpoint (.pt) generated by GPTQ, etc. +If a tuple is provided, it should be (checkpoint, checkpoint config), +where checkpoint is the state_dict and checkpoint config is dict specifying +keys of groups in the state_dict. +The default config is { groups: ‘-1’ }. Change the values of the dict to make a custom config. +Weights shape should be N by K and they are quantized to UINT4 and compressed along K, then stored as +torch.int32. Zero points are also UINT4 and stored as INT32. Scales and bias are floating point values. +Bias is optional. If bias is not in state dict, bias of the original model is used. +Only per-channel quantization of weight is supported (group size = -1). +Default value is None.

  • +
  • sample_inputs (Tuple tensors) – sample inputs used for model quantization or torchscript. +Default value is None, and for well supported model, we provide this sample inputs automaticlly. (only works on CPU)

  • +
  • deployment_mode (bool) – Whether to apply the optimized model for deployment of model generation. +It means there is no need to further apply optimization like torchscirpt. Default value is True. (only works on CPU)

  • +
+
+
Returns:
+

optimized model object for model.generate(), also workable with model.forward

+
+
+
+

Warning

+

Please invoke optimize_transformers function AFTER invoking DeepSpeed in Tensor Parallel +inference scenario.

+
+

Examples

+
>>> # bfloat16 generation inference case.
+>>> model = ...
+>>> model.load_state_dict(torch.load(PATH))
+>>> model.eval()
+>>> optimized_model = ipex.optimize_transformers(model, dtype=torch.bfloat16)
+>>> optimized_model.generate()
+
+
+
+ +
+
+ipex.get_fp32_math_mode(device='cpu')
+

Get the current fpmath_mode setting.

+
+
Parameters:
+

device (string) – cpu, xpu

+
+
Returns:
+

Fpmath mode +The value will be FP32MathMode.FP32, FP32MathMode.BF32 or FP32MathMode.TF32 (GPU ONLY). +oneDNN fpmath mode will be disabled by default if dtype is set to FP32MathMode.FP32. +The implicit FP32 to TF32 data type conversion will be enabled if dtype is set +to FP32MathMode.TF32. The implicit FP32 to BF16 data type conversion will be +enabled if dtype is set to FP32MathMode.BF32.

+
+
+

Examples

+
>>> import intel_extension_for_pytorch as ipex
+>>> # to get the current fpmath mode
+>>> ipex.get_fp32_math_mode(device="xpu")
+
+
+

torch.xpu.get_fp32_math_mode() is an alternative function in Intel® Extension for PyTorch*, +to provide identical usage for XPU device only. The motivation of adding this alias is +to unify the coding style in user scripts base on torch.xpu modular.

+

Examples

+
>>> import intel_extension_for_pytorch as ipex
+>>> # to get the current fpmath mode
+>>> torch.xpu.get_fp32_math_mode(device="xpu")
+
+
+
+ +
+
+ipex.set_fp32_math_mode(mode=FP32MathMode.FP32, device='cpu')
+

Enable or disable implicit data type conversion.

+
+
Parameters:
+
    +
  • mode (FP32MathMode) – FP32MathMode.FP32, FP32MathMode.BF32 or +FP32MathMode.TF32 (GPU ONLY). oneDNN fpmath mode will be disabled by default if dtype +is set to FP32MathMode.FP32. The implicit FP32 to TF32 data type conversion +will be enabled if dtype is set to FP32MathMode.TF32. The implicit FP32 +to BF16 data type conversion will be enabled if dtype is set to FP32MathMode.BF32.

  • +
  • device (string) – cpu, xpu

  • +
+
+
+

Examples

+
>>> import intel_extension_for_pytorch as ipex
+>>> # to enable the implicit data type conversion
+>>> ipex.set_fp32_math_mode(device="xpu", mode=ipex.FP32MathMode.BF32)
+>>> # to disable the implicit data type conversion
+>>> ipex.set_fp32_math_mode(device="xpu", mode=ipex.FP32MathMode.FP32)
+
+
+

torch.xpu.set_fp32_math_mode() is an alternative function in Intel® Extension for PyTorch*, +to provide identical usage for XPU device only. The motivation of adding this alias is +to unify the coding style in user scripts base on torch.xpu modular.

+

Examples

+
>>> import intel_extension_for_pytorch as ipex
+>>> # to enable the implicit data type conversion
+>>> torch.xpu.set_fp32_math_mode(device="xpu", mode=ipex.FP32MathMode.BF32)
+>>> # to disable the implicit data type conversion
+>>> torch.xpu.set_fp32_math_mode(device="xpu", mode=ipex.FP32MathMode.FP32)
+
+
+
+ +
+
+

Memory management

+
+
+torch.xpu.empty_cache() None
+

Releases all unoccupied cached memory currently held by the caching +allocator so that those can be used in other GPU application and visible in +sysman toolkit.

+
+

Note

+

empty_cache() doesn’t increase the amount of GPU +memory available for PyTorch. However, it may help reduce fragmentation +of GPU memory in certain cases. See Memory Management [GPU] for +more details about GPU memory management.

+
+
+ +
+
+torch.xpu.memory_stats(device: int | str | device | None = None) Dict[str, Any]
+

Returns a dictionary of XPU memory allocator statistics for a +given device.

+

The return value of this function is a dictionary of statistics, each of +which is a non-negative integer.

+

Core statistics:

+
    +
  • "allocated.{all,large_pool,small_pool}.{current,peak,allocated,freed}": +number of allocation requests received by the memory allocator.

  • +
  • "allocated_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": +amount of allocated memory.

  • +
  • "segment.{all,large_pool,small_pool}.{current,peak,allocated,freed}": +number of reserved segments from xpuMalloc().

  • +
  • "reserved_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": +amount of reserved memory.

  • +
  • "active.{all,large_pool,small_pool}.{current,peak,allocated,freed}": +number of active memory blocks.

  • +
  • "active_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": +amount of active memory.

  • +
  • "inactive_split.{all,large_pool,small_pool}.{current,peak,allocated,freed}": +number of inactive, non-releasable memory blocks.

  • +
  • "inactive_split_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}": +amount of inactive, non-releasable memory.

  • +
+

For these core statistics, values are broken down as follows.

+

Pool type:

+
    +
  • all: combined statistics across all memory pools.

  • +
  • large_pool: statistics for the large allocation pool +(as of October 2019, for size >= 1MB allocations).

  • +
  • small_pool: statistics for the small allocation pool +(as of October 2019, for size < 1MB allocations).

  • +
+

Metric type:

+
    +
  • current: current value of this metric.

  • +
  • peak: maximum value of this metric.

  • +
  • allocated: historical total increase in this metric.

  • +
  • freed: historical total decrease in this metric.

  • +
+

In addition to the core statistics, we also provide some simple event +counters:

+
    +
  • "num_alloc_retries": number of failed xpuMalloc calls that +result in a cache flush and retry.

  • +
  • "num_ooms": number of out-of-memory errors thrown.

  • +
+
+
Parameters:
+

device (torch.device or int, optional) – selected device. Returns +statistics for the current device, given by current_device(), +if device is None (default).

+
+
+
+

Note

+

See Memory Management [GPU] for more details about GPU memory +management.

+
+
+ +
+
+torch.xpu.memory_summary(device: int | str | device | None = None, abbreviated: bool = False) str
+

Returns a human-readable printout of the current memory allocator +statistics for a given device.

+

This can be useful to display periodically during training, or when +handling out-of-memory exceptions.

+
+
Parameters:
+
    +
  • device (torch.device or int, optional) – selected device. Returns +printout for the current device, given by current_device(), +if device is None (default).

  • +
  • abbreviated (bool, optional) – whether to return an abbreviated summary +(default: False).

  • +
+
+
+
+

Note

+

See Memory Management [GPU] for more details about GPU memory +management.

+
+
+ +
+
+torch.xpu.memory_snapshot()
+

Returns a snapshot of the XPU memory allocator state across all devices.

+

Interpreting the output of this function requires familiarity with the +memory allocator internals.

+
+

Note

+

See Memory Management [GPU] for more details about GPU memory +management.

+
+
+ +
+
+torch.xpu.memory_allocated(device: int | str | device | None = None) int
+

Returns the current GPU memory occupied by tensors in bytes for a given +device.

+
+
Parameters:
+

device (torch.device or int, optional) – selected device. Returns +statistic for the current device, given by current_device(), +if device is None (default).

+
+
+
+

Note

+

This is likely less than the amount shown in sysman toolkit since some +unused memory can be held by the caching allocator and some context +needs to be created on GPU. See Memory Management [GPU] for more +details about GPU memory management.

+
+
+ +
+
+torch.xpu.max_memory_allocated(device: int | str | device | None = None) int
+

Returns the maximum GPU memory occupied by tensors in bytes for a given +device.

+

By default, this returns the peak allocated memory since the beginning of +this program. reset_peak_stats() can be used to +reset the starting point in tracking this metric. For example, these two +functions can measure the peak allocated memory usage of each iteration in a +training loop.

+
+
Parameters:
+

device (torch.device or int, optional) – selected device. Returns +statistic for the current device, given by current_device(), +if device is None (default).

+
+
+
+

Note

+

See Memory Management [GPU] for more details about GPU memory +management.

+
+
+ +
+
+torch.xpu.memory_reserved(device: int | str | device | None = None) int
+

Returns the current GPU memory managed by the caching allocator in bytes +for a given device.

+
+
Parameters:
+

device (torch.device or int, optional) – selected device. Returns +statistic for the current device, given by current_device(), +if device is None (default).

+
+
+
+

Note

+

See Memory Management [GPU] for more details about GPU memory +management.

+
+
+ +
+
+torch.xpu.max_memory_reserved(device: int | str | device | None = None) int
+

Returns the maximum GPU memory managed by the caching allocator in bytes +for a given device.

+

By default, this returns the peak cached memory since the beginning of this +program. reset_peak_stats() can be used to reset +the starting point in tracking this metric. For example, these two functions +can measure the peak cached memory amount of each iteration in a training +loop.

+
+
Parameters:
+

device (torch.device or int, optional) – selected device. Returns +statistic for the current device, given by current_device(), +if device is None (default).

+
+
+
+

Note

+

See Memory Management [GPU] for more details about GPU memory +management.

+
+
+ +
+
+torch.xpu.reset_peak_memory_stats(device: int | str | device | None = None) None
+

Resets the “peak” stats tracked by the XPU memory allocator.

+

See memory_stats() for details. Peak stats correspond to the +“peak” key in each individual stat dict.

+
+
Parameters:
+

device (torch.device or int, optional) – selected device. Returns +statistic for the current device, given by current_device(), +if device is None (default).

+
+
+
+

Note

+

See Memory Management [GPU] for more details about GPU memory +management.

+
+
+ +
+
+torch.xpu.memory_stats_as_nested_dict(device: int | str | device | None = None) Dict[str, Any]
+

Returns the result of memory_stats() as a nested dictionary.

+
+ +
+
+torch.xpu.reset_accumulated_memory_stats(device: int | str | device | None = None) None
+

Resets the “accumulated” (historical) stats tracked by the XPU memory allocator.

+

See memory_stats() for details. Accumulated stats correspond to +the “allocated” and “freed” keys in each individual stat dict, as well as +“num_alloc_retries” and “num_ooms”.

+
+
Parameters:
+

device (torch.device or int, optional) – selected device. Returns +statistic for the current device, given by current_device(), +if device is None (default).

+
+
+
+

Note

+

See Memory Management [GPU] for more details about GPU memory +management.

+
+
+ +
+
+

Quantization

+
+
+ipex.quantization.fp8.fp8_autocast(enabled: bool = False, calibrating: bool = False, fp8_recipe: DelayedScaling | None = None, fp8_group: ProcessGroup | None = None) None
+

Context manager for FP8 usage.

+
with fp8_autocast(enabled=True):
+    out = model(inp)
+
+
+
+
Parameters:
+
    +
  • enabled (bool, default = True) – whether or not to enable fp8

  • +
  • calibrating (bool, default = False) – calibration mode allows collecting statistics such as amax and scale +data of fp8 tensors even when executing without fp8 enabled.

  • +
  • fp8_recipe (recipe.DelayedScaling, default = None) – recipe used for FP8 training.

  • +
  • fp8_group (torch._C._distributed_c10d.ProcessGroup, default = None) – distributed group over which amaxes for the fp8 tensors +are reduced at the end of each training step.

  • +
+
+
+
+ +
+
+

C++ API

+
+
+enum torch_ipex::xpu::FP32_MATH_MODE
+

specifies the available DPCCP packet types

+

Values:

+
+
+enumerator FP32
+

set floating-point math mode to FP32.

+
+ +
+
+enumerator TF32
+

set floating-point math mode to TF32.

+
+ +
+
+enumerator BF32
+

set floating-point math mode to BF32.

+
+ +
+
+enumerator FP32_MATH_MODE_MIN
+
+ +
+
+enumerator FP32_MATH_MODE_MAX
+

set floating-point math mode.

+
+ +
+ +
+
+bool torch_ipex::xpu::set_fp32_math_mode(FP32_MATH_MODE mode)
+

Enable or disable implicit floating-point type conversion during computation for oneDNN kernels. Set FP32MathMode.FP32 will disable floating-point type conversion. Set FP32MathMode.TF32 will enable implicit down-conversion from fp32 to tf32. Set FP32MathMode.BF32 will enable implicit down-conversion from fp32 to bf16.

+

refer to Primitive Attributes: floating -point math mode for detail description about the definition and numerical behavior of floating-point math modes.

+
+
Parameters:
+

mode – (FP32MathMode): Only works for FP32MathMode.FP32, FP32MathMode.TF32 and FP32MathMode.BF32. oneDNN fpmath mode will be disabled by default if dtype is set to FP32MathMode.FP32. The implicit FP32 to TF32 data type conversion will be enabled if dtype is set to FP32MathMode.TF32`. The implicit FP32 to BF16 data type conversion will be enabled if dtype is set to FP32MathMode.BF32`.

+
+
+
+ +
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/contribution.html b/xpu/2.3.110+xpu/tutorials/contribution.html new file mode 100644 index 000000000..0b77abc36 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/contribution.html @@ -0,0 +1,289 @@ + + + + + + + Contribution — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Contribution

+
+

Contributing to Intel® Extension for PyTorch*

+

Thank you for your interest in contributing to Intel® Extension for PyTorch*. Before you begin writing code, it is important that you share your intention to contribute with the team, based on the type of contribution:

+
    +
  1. You want to propose a new feature and implement it.

    +
      +
    • Post about your intended feature in a GitHub issue, and we shall discuss the design and implementation. Once we agree that the plan looks good, go ahead and implement it.

    • +
    +
  2. +
  3. You want to implement a feature or bug-fix for an outstanding issue.

    +
      +
    • Search for your issue in the GitHub issue list.

    • +
    • Pick an issue and comment that you’d like to work on the feature or bug-fix.

    • +
    • If you need more context on a particular issue, ask and we shall provide.

    • +
    +
  4. +
+

Once you implement and test your feature or bug-fix, submit a Pull Request to https://github.com/intel/intel-extension-for-pytorch.

+
+
+

Developing Intel® Extension for PyTorch* on XPU

+

A full set of instructions on installing Intel® Extension for PyTorch* from source is in the Installation document.

+

To develop on your machine, here are some tips:

+
    +
  1. Uninstall all existing Intel® Extension for PyTorch* installs. You may need to run pip uninstall intel_extension_for_pytorch multiple times. You’ll know intel_extension_for_pytorch is fully uninstalled when you see WARNING: Skipping intel_extension_for_pytorch as it is not installed. (You should only have to pip uninstall a few times, but you can always uninstall with timeout or in a loop.)

    +
    yes | pip uninstall intel_extension_for_pytorch
    +
    +
    +
  2. +
  3. Clone a copy of Intel® Extension for PyTorch* from source:

    +
    git clone https://github.com/intel/intel-extension-for-pytorch.git -b xpu-main
    +cd intel-extension-for-pytorch
    +
    +
    +

    If you already have Intel® Extension for PyTorch* from source, update it:

    +
    git pull --rebase
    +git submodule sync --recursive
    +git submodule update --init --recursive --jobs 0
    +
    +
    +
  4. +
  5. Install Intel® Extension for PyTorch* in develop mode:

    +

    Replace:

    +
    python setup.py install
    +
    +
    +

    with:

    +
    python setup.py develop
    +
    +
    +

    This mode will symlink the Python files from the current local source tree into the Python install. After that, if you modify a Python file, you do not need to reinstall Intel® Extension for PyTorch* again. This is especially useful if you are only changing Python files.

    +

    For example:

    +
      +
    • Install local Intel® Extension for PyTorch* in develop mode

    • +
    • modify your Python file intel_extension_for_pytorch/__init__.py (for example)

    • +
    • test functionality

    • +
    +
  6. +
+

You do not need to repeatedly install after modifying Python files (.py). However, you would need to reinstall if you modify a Python interface (.pyi, .pyi.in) or non-Python files (.cpp, .h, etc.).

+

If you want to reinstall, make sure that you uninstall Intel® Extension for PyTorch* first by running pip uninstall intel_extension_for_pytorch until you see WARNING: Skipping intel_extension_for_pytorch as it is not installed. Then run python setup.py clean. After that, you can install in develop mode again.

+
+

Tips and Debugging

+
    +
  • Our setup.py requires Python >= 3.6

  • +
  • If you run into errors when running python setup.py develop, here are some debugging steps:

    +
      +
    1. Remove your build directory. The setup.py script compiles binaries into the build folder and caches many details along the way. This saves time the next time you build. If you’re running into issues, you can always rm -rf build from the toplevel directory and start over.

    2. +
    3. If you have made edits to the Intel® Extension for PyTorch* repo, commit any change you’d like to keep and clean the repo with the following commands (note that clean really removes all untracked files and changes.):

      +
      git submodule deinit -f .
      +git clean -xdf
      +python setup.py clean
      +git submodule update --init --recursive --jobs 0 # very important to sync the submodules
      +python setup.py develop                          # then try running the command again
      +
      +
      +
    4. +
    5. The main step within python setup.py develop is running make from the build directory. If you want to experiment with some environment variables, you can pass them into the command:

      +
      ENV_KEY1=ENV_VAL1[, ENV_KEY2=ENV_VAL2]* python setup.py develop
      +
      +
      +
    6. +
    +
  • +
+
+
+
+

Unit testing

+

All Python test suites are located in the tests/gpu folder and start with test_. Run individual test suites using the command python tests/gpu/${Sub_Folder}/FILENAME.py, where FILENAME represents the file containing the test suite you wish to run and ${Sub_Folder} is one of the following folders:

+
    +
  • examples: unit tests created during op development

  • +
  • experimental: ported test suites from Stock PyTorch 1.10

  • +
  • regression: unit tests created during bug fix to avoid future regression

  • +
+
+

Better local unit tests with pytest

+

We don’t officially support pytest, but it works well with our unit tests and offers a number of useful features for local developing. Install it via pip install pytest.

+

For more information about unit tests, please read README.md in the tests/gpu folder.

+
+
+
+

Writing documentation

+

Do you want to write some documentation for your code contribution and don’t know where to start?

+

Intel® Extension for PyTorch* uses Google style for formatting docstrings. Length of line inside docstrings block must be limited to 80 characters to fit into Jupyter documentation popups.

+
+

Building documentation

+

To build the documentation:

+
    +
  1. Build and install Intel® Extension for PyTorch* (as discussed above)

  2. +
  3. Install the prerequisites:

    +
    cd docs
    +pip install -r requirements.txt
    +
    +
    +
  4. +
  5. Generate the documentation HTML files. The generated files will be in docs/_build/html.

    +
    make clean
    +make html
    +
    +
    +
  6. +
+
+

Tips

+

The .rst source files live in docs/tutorials folder. Some of the .rst files pull in docstrings from Intel® Extension for PyTorch* Python code (for example, via the autofunction or autoclass directives). To shorten doc build times, it is helpful to remove the files you are not working on, only keeping the base index.rst file and the files you are editing. The Sphinx build will produce missing file warnings but will still complete.

+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/examples.html b/xpu/2.3.110+xpu/tutorials/examples.html new file mode 100644 index 000000000..c3a059052 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/examples.html @@ -0,0 +1,1133 @@ + + + + + + + Examples — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Examples

+

These examples will help you get started using Intel® Extension for PyTorch* +with Intel GPUs.

+

Prerequisites: +Before running these examples, install the torchvision and transformers Python packages.

+
    +
  • Python examples demonstrate usage of Python APIs:

    + +
  • +
  • C++ examples demonstrate usage of C++ APIs

  • +
  • Intel® AI Reference Models provide out-of-the-box use cases, demonstrating the performance benefits achievable with Intel Extension for PyTorch*

  • +
+
+

Python

+
+

Training

+
+

Single-Instance Training

+

To use Intel® Extension for PyTorch* on training, you need to make the following changes in your code:

+
    +
  1. Import intel_extension_for_pytorch as ipex.

  2. +
  3. Use the ipex.optimize function for additional performance boost, which applies optimizations against the model object, as well as an optimizer object.

  4. +
  5. Use Auto Mixed Precision (AMP) with BFloat16 data type.

  6. +
  7. Convert input tensors, loss criterion and model to XPU, as shown below:

  8. +
+
...
+import torch
+import intel_extension_for_pytorch as ipex
+...
+model = Model()
+criterion = ...
+optimizer = ...
+model.train()
+# Move model and loss criterion to xpu before calling ipex.optimize()
+model = model.to("xpu")
+criterion = criterion.to("xpu")
+
+# For Float32
+model, optimizer = ipex.optimize(model, optimizer=optimizer)
+# For BFloat16
+model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)
+...
+dataloader = ...
+for (input, target) in dataloader:
+    input = input.to("xpu")
+    target = target.to("xpu")
+    optimizer.zero_grad()
+    # For Float32
+    output = model(input)
+
+    # For BFloat16
+    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
+        output = model(input)
+
+    loss = criterion(output, target)
+    loss.backward()
+    optimizer.step()
+...
+
+
+

Below you can find complete code examples demonstrating how to use the extension on training for different data types:

+
+
Float32
+
import torch
+import torchvision
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+LR = 0.001
+DOWNLOAD = True
+DATA = "datasets/cifar10/"
+
+transform = torchvision.transforms.Compose(
+    [
+        torchvision.transforms.Resize((224, 224)),
+        torchvision.transforms.ToTensor(),
+        torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
+    ]
+)
+train_dataset = torchvision.datasets.CIFAR10(
+    root=DATA,
+    train=True,
+    transform=transform,
+    download=DOWNLOAD,
+)
+train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=128)
+
+model = torchvision.models.resnet50()
+criterion = torch.nn.CrossEntropyLoss()
+optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9)
+model.train()
+######################## code changes #######################
+model = model.to("xpu")
+criterion = criterion.to("xpu")
+model, optimizer = ipex.optimize(model, optimizer=optimizer)
+######################## code changes #######################
+
+for batch_idx, (data, target) in enumerate(train_loader):
+    ########## code changes ##########
+    data = data.to("xpu")
+    target = target.to("xpu")
+    ########## code changes ##########
+    optimizer.zero_grad()
+    output = model(data)
+    loss = criterion(output, target)
+    loss.backward()
+    optimizer.step()
+    print(batch_idx)
+torch.save(
+    {
+        "model_state_dict": model.state_dict(),
+        "optimizer_state_dict": optimizer.state_dict(),
+    },
+    "checkpoint.pth",
+)
+
+print("Execution finished")
+
+
+
+
+
BFloat16
+
import torch
+import torchvision
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+LR = 0.001
+DOWNLOAD = True
+DATA = "datasets/cifar10/"
+
+transform = torchvision.transforms.Compose(
+    [
+        torchvision.transforms.Resize((224, 224)),
+        torchvision.transforms.ToTensor(),
+        torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
+    ]
+)
+train_dataset = torchvision.datasets.CIFAR10(
+    root=DATA,
+    train=True,
+    transform=transform,
+    download=DOWNLOAD,
+)
+train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=128)
+
+model = torchvision.models.resnet50()
+criterion = torch.nn.CrossEntropyLoss()
+optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9)
+model.train()
+##################################### code changes ################################
+model = model.to("xpu")
+criterion = criterion.to("xpu")
+model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)
+##################################### code changes ################################
+
+for batch_idx, (data, target) in enumerate(train_loader):
+    optimizer.zero_grad()
+    ######################### code changes #########################
+    data = data.to("xpu")
+    target = target.to("xpu")
+    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
+    ######################### code changes #########################
+        output = model(data)
+        loss = criterion(output, target)
+    loss.backward()
+    optimizer.step()
+    print(batch_idx)
+torch.save(
+    {
+        "model_state_dict": model.state_dict(),
+        "optimizer_state_dict": optimizer.state_dict(),
+    },
+    "checkpoint.pth",
+)
+
+print("Execution finished")
+
+
+
+
+
+
+

Inference

+

Get additional performance boosts for your computer vision and NLP workloads by +applying the Intel® Extension for PyTorch* optimize function against your +model object.

+
+

Float32

+
+
Imperative Mode
+
+
Resnet50
+
import torch
+import torchvision.models as models
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(1, 3, 224, 224)
+
+######## code changes #######
+model = model.to("xpu")
+data = data.to("xpu")
+model = ipex.optimize(model)
+######## code changes #######
+
+with torch.no_grad():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
BERT
+
import torch
+from transformers import BertModel
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 1
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+######## code changes #######
+model = model.to("xpu")
+data = data.to("xpu")
+model = ipex.optimize(model)
+######## code changes #######
+
+with torch.no_grad():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
+
TorchScript Mode
+

We recommend using Intel® Extension for PyTorch* with TorchScript for further optimizations.

+
+
Resnet50
+
import torch
+import torchvision.models as models
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(1, 3, 224, 224)
+
+######## code changes #######
+model = model.to("xpu")
+data = data.to("xpu")
+model = ipex.optimize(model)
+######## code changes #######
+
+with torch.no_grad():
+    d = torch.rand(1, 3, 224, 224)
+    ##### code changes #####
+    d = d.to("xpu")
+    ##### code changes #####
+    model = torch.jit.trace(model, d)
+    model = torch.jit.freeze(model)
+
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
BERT
+
import torch
+from transformers import BertModel
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 1
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+######## code changes #######
+model = model.to("xpu")
+data = data.to("xpu")
+model = ipex.optimize(model)
+######## code changes #######
+
+with torch.no_grad():
+    d = torch.randint(vocab_size, size=[batch_size, seq_length])
+    ##### code changes #####
+    d = d.to("xpu")
+    ##### code changes #####
+    model = torch.jit.trace(model, (d,), strict=False)
+    model = torch.jit.freeze(model)
+
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
+
+

BFloat16

+

The optimize function works for both Float32 and BFloat16 data type. For BFloat16 data type, set the dtype parameter to torch.bfloat16. +We recommend using Auto Mixed Precision (AMP) with BFloat16 data type.

+
+
Imperative Mode
+
+
Resnet50
+
import torch
+import torchvision.models as models
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(1, 3, 224, 224)
+
+#################### code changes #################
+model = model.to("xpu")
+data = data.to("xpu")
+model = ipex.optimize(model, dtype=torch.bfloat16)
+#################### code changes #################
+
+with torch.no_grad():
+    ############################# code changes #####################
+    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
+    ############################ code changes ######################
+        model(data)
+
+print("Execution finished")
+
+
+
+
+
BERT
+
import torch
+from transformers import BertModel
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 1
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+#################### code changes #################
+model = model.to("xpu")
+data = data.to("xpu")
+model = ipex.optimize(model, dtype=torch.bfloat16)
+#################### code changes #################
+
+with torch.no_grad():
+    ########################### code changes ########################
+    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
+    ########################### code changes ########################
+        model(data)
+
+print("Execution finished")
+
+
+
+
+
+
TorchScript Mode
+

We recommend using Intel® Extension for PyTorch* with TorchScript for further optimizations.

+
+
Resnet50
+
import torch
+import torchvision.models as models
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(1, 3, 224, 224)
+
+#################### code changes #################
+model = model.to("xpu")
+data = data.to("xpu")
+model = ipex.optimize(model, dtype=torch.bfloat16)
+#################### code changes #################
+
+with torch.no_grad():
+    d = torch.rand(1, 3, 224, 224)
+    ############################# code changes #####################
+    d = d.to("xpu")
+    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
+    ############################# code changes #####################
+        model = torch.jit.trace(model, d)
+        model = torch.jit.freeze(model)
+        model(data)
+
+print("Execution finished")
+
+
+
+
+
BERT
+
import torch
+from transformers import BertModel
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 1
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+#################### code changes #################
+model = model.to("xpu")
+data = data.to("xpu")
+model = ipex.optimize(model, dtype=torch.bfloat16)
+#################### code changes #################
+
+with torch.no_grad():
+    d = torch.randint(vocab_size, size=[batch_size, seq_length])
+    ############################# code changes #####################
+    d = d.to("xpu")
+    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
+    ############################# code changes #####################
+        model = torch.jit.trace(model, (d,), strict=False)
+        model = torch.jit.freeze(model)
+
+        model(data)
+
+print("Execution finished")
+
+
+
+
+
+
+

Float16

+

The optimize function works for both Float32 and Float16 data type. For Float16 data type, set the dtype parameter to torch.float16. +We recommend using Auto Mixed Precision (AMP) with Float16 data type.

+
+
Imperative Mode
+
+
Resnet50
+
import torch
+import torchvision.models as models
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(1, 3, 224, 224)
+
+#################### code changes ################
+model = model.to("xpu")
+data = data.to("xpu")
+model = ipex.optimize(model, dtype=torch.float16)
+#################### code changes ################
+
+with torch.no_grad():
+    ############################# code changes #####################
+    with torch.xpu.amp.autocast(enabled=True, dtype=torch.float16):
+    ############################# code changes #####################
+        model(data)
+
+print("Execution finished")
+
+
+
+
+
BERT
+
import torch
+from transformers import BertModel
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 1
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+#################### code changes ################
+model = model.to("xpu")
+data = data.to("xpu")
+model = ipex.optimize(model, dtype=torch.float16)
+#################### code changes ################
+
+with torch.no_grad():
+    ############################# code changes #####################
+    with torch.xpu.amp.autocast(enabled=True, dtype=torch.float16):
+    ############################# code changes #####################
+        model(data)
+
+print("Execution finished")
+
+
+
+
+
+
TorchScript Mode
+

We recommend using Intel® Extension for PyTorch* with TorchScript for further optimizations.

+
+
Resnet50
+
import torch
+import torchvision.models as models
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(1, 3, 224, 224)
+
+#################### code changes ################
+model = model.to("xpu")
+data = data.to("xpu")
+model = ipex.optimize(model, dtype=torch.float16)
+#################### code changes ################
+
+with torch.no_grad():
+    d = torch.rand(1, 3, 224, 224)
+    ############################# code changes #####################
+    d = d.to("xpu")
+    with torch.xpu.amp.autocast(enabled=True, dtype=torch.float16):
+    ############################# code changes #####################
+        model = torch.jit.trace(model, d)
+        model = torch.jit.freeze(model)
+        model(data)
+
+print("Execution finished")
+
+
+
+
+
BERT
+
import torch
+from transformers import BertModel
+
+############# code changes ###############
+import intel_extension_for_pytorch as ipex
+
+############# code changes ###############
+
+model = BertModel.from_pretrained("bert-base-uncased")
+model.eval()
+
+vocab_size = model.config.vocab_size
+batch_size = 1
+seq_length = 512
+data = torch.randint(vocab_size, size=[batch_size, seq_length])
+
+#################### code changes ################
+model = model.to("xpu")
+data = data.to("xpu")
+model = ipex.optimize(model, dtype=torch.float16)
+#################### code changes ################
+
+with torch.no_grad():
+    d = torch.randint(vocab_size, size=[batch_size, seq_length])
+    ############################# code changes #####################
+    d = d.to("xpu")
+    with torch.xpu.amp.autocast(enabled=True, dtype=torch.float16):
+    ############################# code changes #####################
+        model = torch.jit.trace(model, (d,), strict=False)
+        model = torch.jit.freeze(model)
+
+        model(data)
+
+print("Execution finished")
+
+
+
+
+
+
+

INT8

+

We recommend using TorchScript for INT8 model because it has wider support for models. TorchScript mode also auto-enables our optimizations. For TorchScript INT8 model, inserting observer and model quantization is achieved through prepare_jit and convert_jit separately. Calibration process is required for collecting statistics from real data. After conversion, optimizations such as operator fusion would be auto-enabled.

+
import os
+import torch
+from torch.jit._recursive import wrap_cpp_module
+from torch.quantization.quantize_jit import (
+    convert_jit,
+    prepare_jit,
+)
+
+#################### code changes ####################
+import intel_extension_for_pytorch as ipex
+
+######################################################
+
+##### Example Model #####
+import torchvision.models as models
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+model = model.to("xpu")
+
+with torch.no_grad():
+    data = torch.rand(1, 3, 224, 224)
+    data = data.to("xpu")
+    modelJit = torch.jit.trace(model, data)
+#########################
+
+qconfig = torch.quantization.QConfig(
+    activation=torch.quantization.observer.MinMaxObserver.with_args(
+        qscheme=torch.per_tensor_symmetric, reduce_range=False, dtype=torch.quint8
+    ),
+    weight=torch.quantization.default_weight_observer,
+)
+modelJit = prepare_jit(modelJit, {"": qconfig}, True)
+
+##### Example Dataloader #####
+import torchvision
+
+DOWNLOAD = True
+DATA = "datasets/cifar10/"
+
+transform = torchvision.transforms.Compose(
+    [
+        torchvision.transforms.Resize((224, 224)),
+        torchvision.transforms.ToTensor(),
+        torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
+    ]
+)
+train_dataset = torchvision.datasets.CIFAR10(
+    root=DATA,
+    train=True,
+    transform=transform,
+    download=DOWNLOAD,
+)
+calibration_data_loader = torch.utils.data.DataLoader(
+    dataset=train_dataset, batch_size=128
+)
+
+for batch_idx, (d, target) in enumerate(calibration_data_loader):
+    print(f"calibrated on batch {batch_idx} out of {len(calibration_data_loader)}")
+    d = d.to("xpu")
+    modelJit(d)
+##############################
+
+modelJit = convert_jit(modelJit, True)
+
+data = torch.rand(1, 3, 224, 224)
+data = data.to("xpu")
+modelJit(data)
+
+print("Execution finished")
+
+
+
+
+

torch.xpu.optimize

+

The torch.xpu.optimize function is an alternative to ipex.optimize in Intel® Extension for PyTorch*, and provides identical usage for XPU devices only. The motivation for adding this alias is to unify the coding style in user scripts base on torch.xpu modular. Refer to the example below for usage.

+
import torch
+import torchvision.models as models
+
+############# code changes #########
+import intel_extension_for_pytorch
+
+############# code changes #########
+
+model = models.resnet50(weights="ResNet50_Weights.DEFAULT")
+model.eval()
+data = torch.rand(1, 3, 224, 224)
+
+model = model.to(memory_format=torch.channels_last)
+data = data.to(memory_format=torch.channels_last)
+
+########## code changes #########
+model = model.to("xpu")
+data = data.to("xpu")
+model = torch.xpu.optimize(model)
+########## code changes #########
+
+with torch.no_grad():
+    model(data)
+
+print("Execution finished")
+
+
+
+
+
+
+

C++

+

To work with libtorch, the PyTorch C++ library, Intel® Extension for PyTorch* provides its own C++ dynamic library. The C++ library only handles inference workloads, such as service deployment. For regular development, use the Python interface. Unlike using libtorch, no specific code changes are required. Compilation follows the recommended methodology with CMake. Detailed instructions can be found in the PyTorch tutorial.

+

During compilation, Intel optimizations will be activated automatically after the C++ dynamic library of Intel® Extension for PyTorch* is linked.

+

The example code below works for all data types.

+
+

Basic Usage

+

Download and Install cppsdk

+

Ensure you have download and install cppsdk in the installation page before compiling the cpp code.

+
    +
  1. Go to installation page

  2. +
  3. Select the desired Platform & Version & OS

  4. +
  5. In the package part, select cppsdk

  6. +
  7. Follow the instructions in the cppsdk installation page to download and install cppsdk into libtorch.

  8. +
+

example-app.cpp

+
#include <torch/script.h>
+#include <iostream>
+#include <memory>
+
+int main(int argc, const char* argv[]) {
+  torch::jit::script::Module module;
+  try {
+    module = torch::jit::load(argv[1]);
+  }
+  catch (const c10::Error& e) {
+    std::cerr << "error loading the model\n";
+    return -1;
+  }
+  module.to(at::kXPU);
+
+  std::vector<torch::jit::IValue> inputs;
+  torch::Tensor input = torch::rand({1, 3, 224, 224}).to(at::kXPU);
+  inputs.push_back(input);
+
+  at::Tensor output = module.forward(inputs).toTensor();
+  std::cout << output.slice(/*dim=*/1, /*start=*/0, /*end=*/5) << std::endl;
+  std::cout << "Execution finished" << std::endl;
+
+  return 0;
+}
+
+
+

CMakeLists.txt

+
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
+project(example-app)
+
+find_package(IPEX REQUIRED)
+
+set(target example-app)
+add_executable(${target} example-app.cpp)
+target_link_libraries(${target} ${TORCH_IPEX_LIBRARIES})
+
+set_property(TARGET ${target} PROPERTY CXX_STANDARD 17)
+
+
+

Command for compilation

+
$ cd examples/gpu/inference/cpp/example-app
+$ mkdir build
+$ cd build
+$ CC=icx CXX=icpx cmake -DCMAKE_PREFIX_PATH=<LIBPYTORCH_PATH> ..
+$ make
+
+
+

The <LIBPYTORCH_PATH> is the absolute path of libtorch we install at the first step.

+

If Found IPEX is shown as dynamic library paths, the extension was linked into the binary. This can be verified with the Linux command ldd.

+

The value of x, y, z in the following log will change depending on the version you choose.

+
$ CC=icx CXX=icpx cmake -DCMAKE_PREFIX_PATH=/workspace/libtorch ..
+-- The C compiler identification is IntelLLVM 202x.y.z
+-- The CXX compiler identification is IntelLLVM 202x.y.z
+-- Detecting C compiler ABI info
+-- Detecting C compiler ABI info - done
+-- Check for working C compiler: /workspace/intel/oneapi/compiler/202x.y.z/linux/bin/icx - skipped
+-- Detecting C compile features
+-- Detecting C compile features - done
+-- Detecting CXX compiler ABI info
+-- Detecting CXX compiler ABI info - done
+-- Check for working CXX compiler: /workspace/intel/oneapi/compiler/202x.y.z/linux/bin/icpx - skipped
+-- Detecting CXX compile features
+-- Detecting CXX compile features - done
+-- Looking for pthread.h
+-- Looking for pthread.h - found
+-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
+-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
+-- Found Threads: TRUE
+-- Found Torch: /workspace/libtorch/lib/libtorch.so
+-- Found IPEX: /workspace/libtorch/lib/libintel-ext-pt-cpu.so;/workspace/libtorch/lib/libintel-ext-pt-gpu.so
+-- Configuring done
+-- Generating done
+-- Build files have been written to: examples/gpu/inference/cpp/example-app/build
+
+$ ldd example-app
+        ...
+        libtorch.so => /workspace/libtorch/lib/libtorch.so (0x00007fd5bb927000)
+        libc10.so => /workspace/libtorch/lib/libc10.so (0x00007fd5bb895000)
+        libtorch_cpu.so => /workspace/libtorch/lib/libtorch_cpu.so (0x00007fd5a44d8000)
+        libintel-ext-pt-cpu.so => /workspace/libtorch/lib/libintel-ext-pt-cpu.so (0x00007fd5a1a1b000)
+        libintel-ext-pt-gpu.so => /workspace/libtorch/lib/libintel-ext-pt-gpu.so (0x00007fd5862b0000)
+        ...
+        libmkl_intel_lp64.so.2 => /workspace/intel/oneapi/mkl/202x.y.z/lib/intel64/libmkl_intel_lp64.so.2 (0x00007fd584ab0000)
+        libmkl_core.so.2 => /workspace/intel/oneapi/mkl/202x.y.z/lib/intel64/libmkl_core.so.2 (0x00007fd5806cc000)
+        libmkl_gnu_thread.so.2 => /workspace/intel/oneapi/mkl/202x.y.z/lib/intel64/libmkl_gnu_thread.so.2 (0x00007fd57eb1d000)
+        libmkl_sycl.so.3 => /workspace/intel/oneapi/mkl/202x.y.z/lib/intel64/libmkl_sycl.so.3 (0x00007fd55512c000)
+        libOpenCL.so.1 => /workspace/intel/oneapi/compiler/202x.y.z/linux/lib/libOpenCL.so.1 (0x00007fd55511d000)
+        libsvml.so => /workspace/intel/oneapi/compiler/202x.y.z/linux/compiler/lib/intel64_lin/libsvml.so (0x00007fd553b11000)
+        libirng.so => /workspace/intel/oneapi/compiler/202x.y.z/linux/compiler/lib/intel64_lin/libirng.so (0x00007fd553600000)
+        libimf.so => /workspace/intel/oneapi/compiler/202x.y.z/linux/compiler/lib/intel64_lin/libimf.so (0x00007fd55321b000)
+        libintlc.so.5 => /workspace/intel/oneapi/compiler/202x.y.z/linux/compiler/lib/intel64_lin/libintlc.so.5 (0x00007fd553a9c000)
+        libsycl.so.6 => /workspace/intel/oneapi/compiler/202x.y.z/linux/lib/libsycl.so.6 (0x00007fd552f36000)
+        ...
+
+
+
+
+

Use SYCL code

+

Using SYCL code in an C++ application is also possible. The example below shows how to invoke SYCL codes. You need to explicitly pass -fsycl into CMAKE_CXX_FLAGS.

+

example-usm.cpp

+
#include <iostream>
+#include <memory>
+#include <torch/script.h>
+#include <c10/xpu/XPUStream.h>
+#include <ATen/ATen.h>
+#include <CL/sycl.hpp>
+
+using namespace sycl;
+
+int main(int argc, const char* argv[]) {
+  torch::jit::script::Module module;
+  try {
+    module = torch::jit::load(argv[1]);
+  }
+  catch (const c10::Error& e) {
+    std::cerr << "error loading the model\n";
+    return -1;
+  }
+  std::cout << "load model done " << std::endl;
+  module.to(at::kXPU);
+
+  std::vector<torch::jit::IValue> inputs;
+  c10::xpu::XPUStream stream = c10::xpu::getCurrentXPUStream();
+  auto options = at::TensorOptions().dtype(at::kFloat).device(stream.device());
+  float *input_ptr = malloc_device<float>(224 * 224 * 3, stream);
+  auto input = torch::from_blob(
+      input_ptr,
+      {1, 3, 224, 224},
+      options);
+  std::cout << "input tensor created from usm " << std::endl;
+  inputs.push_back(input);
+
+  at::IValue output = module.forward(inputs);
+  torch::Tensor output_tensor;
+  output_tensor = output.toTensor();
+  std::cout << output_tensor.slice(/*dim=*/1, /*start=*/0, /*end=*/5) << std::endl;
+  std::cout << "Execution finished" << std::endl;
+
+  return 0;
+}
+
+
+

CMakeLists.txt

+
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
+project(example-usm)
+
+find_package(IPEX REQUIRED)
+
+set(target example-usm)
+add_executable(${target} example-usm.cpp)
+target_link_libraries(${target} ${TORCH_IPEX_LIBRARIES})
+list(APPEND CMAKE_CXX_FLAGS "-fsycl")
+
+set_property(TARGET ${target} PROPERTY CXX_STANDARD 17)
+
+
+
+
+

Customize DPC++ kernels

+

Intel® Extension for PyTorch* provides its C++ dynamic library to allow users to implement custom DPC++ kernels to run on the XPU device. Refer to the DPC++ extension for details.

+
+
+
+

Intel® AI Reference Models

+

Use cases that have already been optimized by Intel engineers are available at Intel® AI Reference Models (former Model Zoo). A number of PyTorch use cases for benchmarking are also available in the Use Cases section. Models verified on Intel GPUs are marked in the Model Documentation column. You can get performance benefits out-of-the-box by simply running scripts in the Intel® AI Reference Models.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features.html b/xpu/2.3.110+xpu/tutorials/features.html new file mode 100644 index 000000000..ad66c257b --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features.html @@ -0,0 +1,293 @@ + + + + + + + Features — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Features

+
+

Easy-to-use Python API

+

Intel® Extension for PyTorch* provides simple frontend Python APIs and utilities to get performance optimizations such as operator optimization.

+

Check the API Documentation for API functions description and Examples for usage guidance.

+
+
+

Channels Last

+

Compared with the default NCHW memory format, using channels_last (NHWC) memory format can further accelerate convolutional neural networks. In Intel® Extension for PyTorch*, NHWC memory format has been enabled for most key CPU and GPU operators. More detailed information is available at Channels Last.

+

Intel® Extension for PyTorch* automatically converts a model to channels last memory format when users optimize the model with ipex.optimize(model). With this feature, users do not need to manually apply model=model.to(memory_format=torch.channels_last) anymore. However, models running on Intel® Data Center GPU Flex Series will choose oneDNN layout, so users still need to manually convert the model and data to channels last format. More detailed information is available at Auto Channels Last.

+
+
+
+
+

Auto Mixed Precision (AMP)

+

Benefiting from less memory usage and computation, low precision data types typically speed up both training and inference workloads. +On GPU side, support of BFloat16 and Float16 are both available in Intel® Extension for PyTorch*. BFloat16 is the default low precision floating data type when AMP is enabled.

+

Detailed information of AMP for GPU are available at Auto Mixed Precision (AMP) on GPU.

+
+
+
+
+

Quantization

+

Intel® Extension for PyTorch* currently supports imperative mode and TorchScript mode for post-training static quantization on GPU. This section illustrates the quantization workflow on Intel GPUs.

+

Check more detailed information for INT8 Quantization.

+

On Intel® GPUs, Intel® Extension for PyTorch* also provides FP8 Quantization. Check more detailed information for FP8 Quantization.

+
+
+
+
+

Distributed Training

+

To meet demands of large scale model training over multiple devices, distributed training on Intel® GPUs is supported. Two alternative methodologies are available. Users can choose either to use PyTorch native distributed training module, Distributed Data Parallel (DDP), with Intel® oneAPI Collective Communications Library (oneCCL) support via Intel® oneCCL Bindings for PyTorch (formerly known as torch_ccl) or use Horovod with Intel® oneAPI Collective Communications Library (oneCCL) support (Prototype).

+

For more detailed information, check DDP and Horovod (Prototype).

+
+
+
+
+

DLPack Solution

+

DLPack defines a stable in-memory data structure for sharing tensors among frameworks. It enables sharing of tensor data without copying when interoparating with other libraries. Intel® Extension for PyTorch* extends DLPack support in PyTorch* for XPU device particularly.

+

For more detailed information, check DLPack Solution.

+
+
+
+
+

DPC++ Extension

+

Intel® Extension for PyTorch* provides C++ APIs to get SYCL queue and configure floating-point math mode.

+

Check the API Documentation for the details of API functions. DPC++ Extension describes how to write customized DPC++ kernels with a practical example and build it with setuptools and CMake.

+
+
+
+
+

Advanced Configuration

+

The default settings for Intel® Extension for PyTorch* are sufficient for most use cases. However, if you need to customize Intel® Extension for PyTorch*, advanced configuration is available at build time and runtime.

+

For more detailed information, check Advanced Configuration.

+

A driver environment variable ZE_FLAT_DEVICE_HIERARCHY is currently used to select the device hierarchy model with which the underlying hardware is exposed. By default, each GPU tile is used as a device. Check the Level Zero Specification Documentation for more details.

+
+
+
+
+

Fully Sharded Data Parallel (FSDP)

+

Fully Sharded Data Parallel (FSDP) is a PyTorch* module that provides industry-grade solution for large model training. FSDP is a type of data parallel training, unlike DDP, where each process/worker maintains a replica of the model, FSDP shards model parameters, optimizer states and gradients across DDP ranks to reduce the GPU memory footprint used in training. This makes the training of some large-scale models feasible.

+

For more detailed information, check FSDP.

+
+
+
+
+

torch.compile for GPU (Beta)

+

Intel® Extension for PyTorch* now empowers users to seamlessly harness graph compilation capabilities for optimal PyTorch model performance on Intel GPU via the flagship torch.compile API through the default “inductor” backend (TorchInductor ).

+

For more detailed information, check torch.compile for GPU.

+
+
+
+
+

Kineto Supported Profiler Tool (Prototype)

+

The Kineto supported profiler tool is an extension of PyTorch* profiler for profiling operators’ executing time cost on GPU devices. With this tool, you can get information in many fields of the run models or code scripts. Build Intel® Extension for PyTorch* with Kineto support as default and enable this tool using the with statement before the code segment.

+

For more detailed information, check Profiler Kineto.

+
+
+
+
+

Compute Engine (Prototype feature for debug)

+

Compute engine is a prototype feature which provides the capacity to choose specific backend for operators with multiple implementations.

+

For more detailed information, check Compute Engine.

+
+
+
+
+

IPEX_LOG (Prototype feature for debug)

+

IPEX_LOG provides the capability to log verbose information from Intel® Extension for PyTorch* . Please use IPEX_LOG to get the log information or trace the execution from Intel® Extension for PyTorch*. Please continue using PyTorch* macros such as TORCH_CHECK, TORCH_ERROR, etc. to get the log information from PyTorch*.

+

For more detailed information, check IPEX_LOG.

+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/DDP.html b/xpu/2.3.110+xpu/tutorials/features/DDP.html new file mode 100644 index 000000000..c38aaa67f --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/DDP.html @@ -0,0 +1,410 @@ + + + + + + + DistributedDataParallel (DDP) — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

DistributedDataParallel (DDP)

+
+

Introduction

+

DistributedDataParallel (DDP) is a PyTorch* module that implements multi-process data parallelism across multiple GPUs and machines. With DDP, the model is replicated on every process, and each model replica is fed a different set of input data samples. Please refer to DDP Tutorial for an introduction to DDP.

+

The PyTorch Collective Communication (c10d) library supports communication across processes. To run DDP on GPU, we use Intel® oneCCL Bindings for Pytorch* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library* (oneCCL), a library for efficient distributed deep learning training implementing such collectives as allreduce, allgather, and alltoall. Refer to oneCCL Github page for more information about oneCCL.

+
+
+

Installation of Intel® oneCCL Bindings for Pytorch*

+

To use PyTorch DDP on GPU, install Intel® oneCCL Bindings for Pytorch* as described below.

+
+

Install PyTorch and Intel® Extension for PyTorch*

+

Make sure you have installed PyTorch and Intel® Extension for PyTorch* successfully. +For more detailed information, check Installation Guide.

+
+
+

Install Intel® oneCCL Bindings for Pytorch*

+ +
+

Install from source

+

Refer to Installation Guide to install Intel® oneCCL Bindings for Pytorch* from source.

+
+
+
+

Runtime Dynamic Linking

+
    +
  • dynamic link oneCCl from oneAPI basekit:

  • +
+
source <ONEAPI_ROOT>/ccl/latest/env/vars.sh
+source <ONEAPI_ROOT>/mpi/latest/env/vars.sh
+
+
+

Note: Make sure you have installed basekit when using Intel® oneCCL Bindings for Pytorch* on Intel® GPUs. If the basekit is installed with a package manager, <ONEAPI_ROOT> is /opt/intel/oneapi.

+
+
+
+

DDP Usage

+

DDP follows its usage in PyTorch. To use DDP with Intel® Extension for PyTorch*, make the following modifications to your model script:

+
    +
  • Import the necessary packages.

  • +
+
import torch
+import intel_extension_for_pytorch 
+import oneccl_bindings_for_pytorch
+
+
+
    +
  • Initialize the process group with ccl backend.

  • +
+
dist.init_process_group(backend='ccl')
+
+
+
    +
  • For DDP with each process exclusively works on a single GPU, set the device ID as local rank. This step is not required for usage on CPU.

  • +
+
device = "xpu:{}".format(args.local_rank)
+torch.xpu.set_device(device)
+
+
+
    +
  • Wrap model by DDP.

  • +
+
model = model.to(device)
+model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[device])
+
+
+

Note: For single-device modules, device_ids can contain exactly one device id, which represents the only GPU device where the input module corresponding to this process resides. Alternatively, device_ids can be None.

+

Note: When using torch.xpu.optimize for distributed training with low precision, the torch.xpu.manual_seed(seed_number) is needed to make sure the master weight is the same on all ranks.

+
+
+

Example Usage (MPI launch for single node):

+

Intel® oneCCL Bindings for Pytorch* recommends MPI as the launcher to start multiple processes. Here’s an example to illustrate such usage.

+

Dynamic link oneCCL and Intel MPI libraries:

+
source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh
+# Or
+source <ONEAPI_ROOT>/ccl/latest/env/vars.sh
+source <ONEAPI_ROOT>/mpi/latest/env/vars.sh
+
+
+

Example_DDP.py

+
"""
+This example shows how to use MPI as the launcher to start DDP on single node with multiple devices.
+"""
+import os
+import torch
+import torch.nn as nn
+from torch.nn.parallel import DistributedDataParallel as DDP
+import torch.distributed as dist
+import intel_extension_for_pytorch
+import oneccl_bindings_for_pytorch
+
+
+class Model(nn.Module):
+    def __init__(self):
+        super(Model, self).__init__()
+        self.linear = nn.Linear(4, 5)
+
+    def forward(self, input):
+        return self.linear(input)
+
+
+if __name__ == "__main__":
+
+    torch.xpu.manual_seed(123)  # set a seed number
+    mpi_world_size = int(os.environ.get('PMI_SIZE', -1))
+    mpi_rank = int(os.environ.get('PMI_RANK', -1))
+    if mpi_world_size > 0:
+        os.environ['RANK'] = str(mpi_rank)
+        os.environ['WORLD_SIZE'] = str(mpi_world_size)
+    else:
+        # set the default rank and world size to 0 and 1
+        os.environ['RANK'] = str(os.environ.get('RANK', 0))
+        os.environ['WORLD_SIZE'] = str(os.environ.get('WORLD_SIZE', 1))
+    os.environ['MASTER_ADDR'] = '127.0.0.1'  # your master address
+    os.environ['MASTER_PORT'] = '29500'  # your master port
+
+    # Initialize the process group with ccl backend
+    dist.init_process_group(backend='ccl')
+
+    # For single-node distributed training, local_rank is the same as global rank
+    local_rank = dist.get_rank()
+    # Only set device for distributed training on GPU
+    device = "xpu:{}".format(local_rank)
+    model = Model().to(device)
+    if dist.get_world_size() > 1:
+        model = DDP(model, device_ids=[device])
+
+    optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
+    loss_fn = nn.MSELoss().to(device)
+    for i in range(3):
+        print("Runing Iteration: {} on device {}".format(i, device))
+        input = torch.randn(2, 4).to(device)
+        labels = torch.randn(2, 5).to(device)
+        # forward
+        print("Runing forward: {} on device {}".format(i, device))
+        res = model(input)
+        # loss
+        print("Runing loss: {} on device {}".format(i, device))
+        L = loss_fn(res, labels)
+        # backward
+        print("Runing backward: {} on device {}".format(i, device))
+        L.backward()
+        # update
+        print("Runing optim: {} on device {}".format(i, device))
+        optimizer.step()
+
+
+

Running command:

+
mpirun -n 2 -l python Example_DDP.py
+
+
+
+
+

DDP scaling API (GPU Only)

+

For using one GPU card with multiple tiles, each tile could be regarded as a device for explicit scaling. We provide a DDP scaling API to enable DDP on one GPU card in GitHub repo.

+
+

Usage of DDP scaling API

+

Note: This API supports GPU devices on one card.

+
Args:
+model: model to be parallelized
+train_dataset: dataset for training
+
+
+

If you have a model running on a single tile, you only need to make minor changes to enable the DDP training by following these steps:

+
    +
  • Import the API:

  • +
+
try:
+    from intel_extension_for_pytorch.xpu.single_card import single_card_dist
+except ImportError:
+    raise ImportError("single_card_dist not available!")
+
+
+
    +
  • Use multi_process_spawn launcher as a torch.multiprocessing wrapper.

  • +
+
single_card_dist.multi_process_spawn(main_worker, (args, )) # put arguments of main_worker into a tuple
+
+
+
    +
  • Usage of this API:

  • +
+
dist = single_card_dist(model, train_dataset)
+local_rank, model, train_sampler = dist.rank, dist.model, dist.train_sampler
+
+
+
    +
  • Set in the model training:

  • +
+
for epoch in range ...
+    train_sampler.set_epoch(epoch)
+
+
+
    +
  • Adjust the model to call local_rank, model, and train_sampler as shown here:

  • +
+
    +
  • device: get the xpu information used in model training

  • +
+
xpu = "xpu:{}".format(local_rank)
+print("DDP Use XPU: {} for training".format(xpu))
+
+
+
    +
  • model: use the model warpped by DDP in the following training

  • +
  • train_sampler: use the train_sampler to get the train_loader

  • +
+
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
+    num_workers=args.workers, pin_memory=True, sampler=train_sampler)
+
+
+

Then you can start your model training on multiple GPU devices of one card.

+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + diff --git a/xpu/2.3.110+xpu/tutorials/features/DLPack.html b/xpu/2.3.110+xpu/tutorials/features/DLPack.html new file mode 100644 index 000000000..792021d34 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/DLPack.html @@ -0,0 +1,241 @@ + + + + + + + DLPack Solution — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

DLPack Solution

+
+

Introduction

+

DLPack defines a stable in-memory data structure for sharing tensors among frameworks. It is a solution with wide community adoption and supports Numpy, PyTorch, and other popular frameworks in deep learning domain. Intel® Extension for PyTorch* extends DLPack support in PyTorch for XPU backend particularly, in order to share tensor data without copy when interoperating with other libraries via DLPack solution. The current supported DLPack version is v0.7.

+
+
+

Use Case

+

The following use case demonstrates two typical DLPack usages relate to Intel® Extension for PyTorch*. One is to import external tensor to Intel® Extension for PyTorch*. The tensor from an external library is packed in DLPack capsule, then converted to PyTorch tensor on XPU, to be operable in Intel® Extension for PyTorch*. The other is to export PyTorch tensor on XPU to an external library. The PyTorch tensor on XPU is packed in DLPack capsule, so that the external library can operate on this shared tensor via DLPack solution.

+
import intel_extension_for_pytorch
+import torch.utils.dlpack
+
+# create DLPack capsule from external
+capsule = ...
+
+# Usage 1: convert DLPack capsule to PyTorch tensor on XPU
+t = torch.from_dlpack(capsule)
+
+# create PyTorch tensor on XPU
+t2 = torch.empty([10], device='xpu')
+
+# Usage 2: convert PyTorch tensor on XPU to DLPack capsule
+capsule2 = torch.to_dlpack(t2)
+
+
+
+
+

Design

+

When import an external tensor which is in DLManagedTensor format, a PyTorch tensor is created and other required information such as dim, sizes, and strides are parsed and extracted from the external tensor to PyTorch tensor by Stock PyTorch. The data_ptr points to the original memory allocation and a data copy is not required. Here Intel® Extension for PyTorch* is responsible for converting device type and id from DLDevice to ATen device for XPU backend.

+
+

Import DLPack Capsule

+

fig-1-DLPack-import

+

When exporting PyTorch tensor, a ATenDLMTensor is created with its handle pointing to the original PyTorch tensor and its tensor contains the exported tensor in DLManagedTensor format. The required information such as ndim, shape, and strides are parsed and extracted from PyTorch tensor to external tensor. The data pointer points to the original memory allocation and data copy is not required. Here Intel® Extension for PyTorch* is responsible for converting device type and id from ATen device to DLDevice for XPU backend.

+
+
+

Export DLPack Capsule

+

fig-2-DLPack-import

+

Note: The used DLManagedTensor format in above figures is from https://dmlc.github.io/dlpack/latest/python_spec.html.

+
+
+

DLDevice and data pointer

+

DLDeviceType in DLDevice is kDLOneAPI for sharing memory between Intel® Extension for PyTorch* and other libraries. It is not kDLSycl since it relies on oneAPI SYCL extensions filter_selector and default platform context to operate. The device_id in DLDevice is one of the SYCL runtime device ids, which may be different from the actual framework device in use. When producing a DLPack capsule, DPC++ runtime will get the device where memory allocation was original made. If the device has parent device, we will find its parent device index enumerated in sycl::device::get_devices() then put to device_id.

+

data pointer points to the shared data via DLPack to be accessed by consumer. Only USM allocations are valid in data pointer when DLDeviceType is kDLOneAPI. SYCL 2020 Specification defines three types of USM allocations: sycl::usm::host, sycl::usm::device, and sycl::usm::shared. sycl::usm::device is the only supported type. Also the USM allocations in sycl::usm::device are valid in DLPack only when the memory allocation was made under default SYCL context per SYCL platform.

+
+
+
+

Asynchronous Programming

+

So far, DLPack defines how the producer shares memory allocations in DLPack capsule format and how consumer recognizes the shared memory allocations. It does not define the synchronization method between producer and consumer so that both sides know when it is safe to access the data in shared memory allocations. Under the situation that the producer and the consumer probably have different implementation for supporting asynchronous programming, it is hard to define a general solution for various scenarios. It is up to consumer to monitor the execution flow of Intel® Extension for PyTorch* and find out when the data is ready to use.

+

The following example shows one possible solution for the consumer to safely use USM allocations from Intel® Extension for PyTorch*.

+
+

Example Case

+
import intel_extension_for_pytorch
+import torch.utils.dlpack
+
+# Get shared tensor from Intel® Extension for PyTorch* via DLPack
+t = torch.from_dlpack(capsule)
+
+# Wait for the data ready to use
+torch.xpu.synchronize()
+
+# Use the data in shared tensor
+...
+
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/DPC++_Extension.html b/xpu/2.3.110+xpu/tutorials/features/DPC++_Extension.html new file mode 100644 index 000000000..81acd72eb --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/DPC++_Extension.html @@ -0,0 +1,653 @@ + + + + + + + DPC++ Extension — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

DPC++ Extension

+
+

Introduction

+

C++ extension is a mechanism developed by PyTorch that lets you to create customized and highly efficient PyTorch operators defined out-of-source, i.e. separate from the PyTorch backend. (For more details, see https://pytorch.org/tutorials/advanced/cpp_extension.html). Based on the PyTorch C++ extension mechanism, Intel® Extension for PyTorch* lets you to create PyTorch operators with custom DPC++ kernels to run on the XPU device.

+

Note: The current implementation of the DPC++ extension only supports Linux.

+
+
+

Motivation and Example

+

This tutorial walks through a practical example of writing and using a DPC++ extension on the XPU device with Intel® Extension for PyTorch*.

+
+
+

Writing a DPC++ Extension

+

DPC++ extensions come in two flavors: They can be built “ahead of time” (AOT) with setuptools, or “just in time” (JIT) via torch.xpu.cpp_extension.load(). We’ll begin with the first approach and discuss the latter one afterwards.

+

Besides, DPC++ extension also supports compilation with CMake. We’ll discuss the CMake methodology at last.

+
+

Building with setuptools

+

For building with setuptools, we build our DPC++ extension by writing a setup.py script that uses setuptools to compile our C++ code. For the Long-Long-Term-Memory unit (LLTM), it looks like this:

+
from setuptools import setup
+import torch
+import intel_extension_for_pytorch
+from torch.xpu.cpp_extension import DPCPPExtension, DpcppBuildExtension
+
+setup(
+    name='lltm',
+    ext_modules=[
+        DPCPPExtension('lltm_xpu', [
+            'lltm_xpu.cpp',
+            'lltm_xpu_kernel.cpp',
+        ])
+    ],
+    cmdclass={
+        'build_ext': DpcppBuildExtension
+    })
+
+
+

In this code, DPCPPExtension is a convenience wrapper around setuptools.Extension that passes the correct include paths and sets the language of the extension to C++. The equivalent vanilla setuptools code would simply be:

+
Extension(
+   name='lltm_xpu',
+   sources=['lltm_xpu.cpp', 'lltm_xpu_kernel.cpp',],
+   include_dirs=cpp_extension.include_paths(),
+   language='c++')
+
+
+

DpcppBuildExtension performs a number of required configuration steps and checks and also manages compilation in the case of DPC++ extensions. And that’s all we really need to know about building DPC++ extensions for now.

+

Let’s take a look at the implementation of our DPC++ extension, which goes into lltm_xpu.cpp and lltm_xpu_kernel.cpp. +After building the Python module with DPC++ extension, the lltm_xpu is available for importing as an extension plug-in.

+
import lltm_xpu
+
+
+
+
+

JIT Compiling Extensions

+

Previously, we mentioned that there were two ways of building DPC++ extensions: use setuptools as AOT or compile with JIT. Having the former one introduced, let’s elaborate on the latter one. The JIT compilation mechanism provides a methodology to compile and load your extensions on the fly by invoking a simple torch API function torch.xpu.cpp_extension.load(). For the LLTM, this would look as simple as this:

+
import torch
+import intel_extension_for_pytorch
+from torch.xpu.cpp_extension import load
+
+lltm_xpu = load(name="lltm_xpu", sources=['lltm_xpu.cpp', 'lltm_xpu_kernel.cpp',])
+
+
+

Here, we provide a function with the same information as those for setuptools. In the background, the function will do the followings:

+
    +
  1. Create a temporary directory /tmp/torch_extensions/py[ver]_xpu/lltm_xpu,

  2. +
  3. Emit a Ninja build file into that temporary directory,

  4. +
  5. Compile your source files into a shared library,

  6. +
  7. Import this shared library as a Python module.

  8. +
+

In fact, if you pass verbose=True to cpp_extension.load(), you will be informed about the process:

+
Emitting ninja build file /home/[user_name]/.cache/torch_extensions/py[ver]_xpu/lltm_xpu/build.ninja...
+Building extension module lltm_xpu...
+Loading extension module lltm_xpu...
+
+
+

The resulting Python module are exactly the same as the ones produced by setuptools. This avoids maintaining a separate setup.py build file. Generally this JIT technique will do the compilation just fine, however, if your setup is more complicated and you do need the full power of setuptools, you can still write your own setup.py. It will take some time at the first time when you run through this line, as the extension is compiling in the background. Since we use Ninja build system to build source codes, re-compilation is incremental and thus the compilation reloads the extension when you run your Python module from the second time. It is fast and has low overhead if there are no code changes in the extension’s source files.

+
+
+

Building with CMake

+

For building with CMake, we build our DPC++ extension by writing a CMakeLists.txt file that uses CMake to build our C++ code. For the same example we showed using setuptools, the CMakeLists.txt looks like this: +CMakeLists.txt

+
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
+project(lltm_xpu)
+
+find_package(Python COMPONENTS Interpreter Development)
+find_package(Torch REQUIRED)
+find_package(IPEX REQUIRED)
+
+#The SYCL kernel should be compiled with "-fsycl"
+set_source_files_properties(lltm_xpu_kernel.cpp PROPERTIES COMPILE_FLAGS "-fsycl")
+
+add_library(lltm_xpu SHARED lltm_xpu.cpp lltm_xpu_kernel.cpp)
+target_link_libraries(lltm_xpu "${TORCH_LIBRARIES}")
+target_link_libraries(lltm_xpu "${TORCH_IPEX_LIBRARIES}")
+target_include_directories(lltm_xpu PUBLIC "${Python_INCLUDE_DIRS}")
+target_include_directories(lltm_xpu PUBLIC "${TORCH_IPEX_INCLUDE_DIRS}")
+
+set_property(TARGET lltm_xpu PROPERTY CXX_STANDARD 17)
+#DPCPP need 17
+
+
+

Find cmake_prefix_path of torch and ipex

+
$ python
+>>> import torch
+>>> import intel_extension_for_pytorch
+>>> torch.utils.cmake_prefix_path
+'<cmake_prefix_path for torch>'
+>>> intel_extension_for_pytorch.cmake_prefix_path
+'<cmake_prefix_path for ipex>'
+
+
+

Commands for compilation:

+
$ cmake -DCMAKE_PREFIX_PATH=<torch & ipex cmake_prefix_path> -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=<icpx|icx> ..
+$ make
+
+
+

After build the python module with CMake, the lltm_xpu is also avalible for importing as a extension plug-in like setuptools method.

+
$ python
+>>> import torch
+>>> import intel_extension_for_pytorch
+>>> import lltm_xpu
+
+
+
+
+

Requesting the current c10::xpu::XPUStream

+

If you need to get the current c10::xpu::XPUStream on the current XPU device to do synchronization. You can implement it as below.

+
c10::xpu::XPUStream stream = c10::xpu::getCurrentXPUStream();
+stream.synchronize();
+
+
+
+
+

Fetching the corresponding sycl::queue

+

We provide some APIs to fetch the corresponding sycl::queue associated with the +current c10::xpu::XPUStream. +In C++ code, you can fetch a sycl::queue reference as below.

+
c10::xpu::XPUStream stream = c10::xpu::getCurrentXPUStream();
+auto& queue = stream.queue();
+
+
+

In python code, you can use the below codes to get a void*, which can cast to a sycl::queue pointer.

+
import torch
+import intel_extension_for_pytorch
+stream = torch.xpu.current_stream()
+queue = stream.sycl_queue # queue is a ``void*`` which can cast to a sycl::queue pointer
+
+
+

Subsequently, you can submit a customized kernel via sycl::queue by yourself. Refer to Writing the DPC++ Op for more details.

+
+
+

Writing the DPC++ Op

+

The general strategy for writing a DPC++ extension is to write a C++ file that defines the functions that are called from Python, and binds those functions to Python with pybind11. The C++ functions do some checks and ultimately forward the calls to submit SYCL kernels. The ipex.cpp_extension package then takes care of compiling the C++ sources with a DPC++ compiler.

+

Let’s consider the PyTorch CUDA examples https://pytorch.org/tutorials/advanced/cpp_extension.html#writing-a-mixed-c-cuda-extension. Here is how we implement it in DPC++ style:

+
#include <torch/extension.h>
+
+#include <vector>
+
+// XPU forward declarations
+
+std::vector<torch::Tensor> lltm_xpu_forward(
+    torch::Tensor input,
+    torch::Tensor weights,
+    torch::Tensor bias,
+    torch::Tensor old_h,
+    torch::Tensor old_cell);
+
+std::vector<torch::Tensor> lltm_xpu_backward(
+    torch::Tensor grad_h,
+    torch::Tensor grad_cell,
+    torch::Tensor new_cell,
+    torch::Tensor input_gate,
+    torch::Tensor output_gate,
+    torch::Tensor candidate_cell,
+    torch::Tensor X,
+    torch::Tensor gate_weights,
+    torch::Tensor weights);
+
+// C++ interface
+
+#define CHECK_XPU(x) TORCH_CHECK(x.device().is_xpu(), #x " must be a XPU tensor")
+#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
+#define CHECK_INPUT(x) CHECK_XPU(x); CHECK_CONTIGUOUS(x)
+
+std::vector<torch::Tensor> lltm_forward(
+    torch::Tensor input,
+    torch::Tensor weights,
+    torch::Tensor bias,
+    torch::Tensor old_h,
+    torch::Tensor old_cell) {
+  CHECK_INPUT(input);
+  CHECK_INPUT(weights);
+  CHECK_INPUT(bias);
+  CHECK_INPUT(old_h);
+  CHECK_INPUT(old_cell);
+
+  return lltm_xpu_forward(input, weights, bias, old_h, old_cell);
+}
+
+std::vector<torch::Tensor> lltm_backward(
+    torch::Tensor grad_h,
+    torch::Tensor grad_cell,
+    torch::Tensor new_cell,
+    torch::Tensor input_gate,
+    torch::Tensor output_gate,
+    torch::Tensor candidate_cell,
+    torch::Tensor X,
+    torch::Tensor gate_weights,
+    torch::Tensor weights) {
+  CHECK_INPUT(grad_h);
+  CHECK_INPUT(grad_cell);
+  CHECK_INPUT(input_gate);
+  CHECK_INPUT(output_gate);
+  CHECK_INPUT(candidate_cell);
+  CHECK_INPUT(X);
+  CHECK_INPUT(gate_weights);
+  CHECK_INPUT(weights);
+
+  return lltm_xpu_backward(
+      grad_h,
+      grad_cell,
+      new_cell,
+      input_gate,
+      output_gate,
+      candidate_cell,
+      X,
+      gate_weights,
+      weights);
+}
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+  m.def("forward", &lltm_forward, "LLTM forward (XPU)");
+  m.def("backward", &lltm_backward, "LLTM backward (XPU)");
+}
+
+
+

The bridge code checks and forwards the calls to functions that we’ll define in the DPC++ code file lltm_xpu_kernel.cpp. DPC++ supports compiling C++ naturally, thus we still have ATen and the C++ standard library available to us.

+

Let’s go through the DPC++ code step by step:

+
#include <torch/extension.h>
+
+#include <ipex.h>
+
+#include <vector>
+
+template <typename scalar_t>
+scalar_t sigmoid(scalar_t z) {
+  return 1.0f / (1.0f + exp(-z));
+}
+
+
+

At the beginning of the code, we include <torch/extension.h> that will introduce all the torch definitions into the code. After that, the <ipex.h> line includes the SYCL header in DPC++. With the <torch/extension.h> and <ipex.h>, all the essential declarations have been included for writing the DPC++ kernel to run on the XPU device. The helper function sigmoid does the math calculation with the more efficient C++ language. Next are some more helper functions for LLTM:

+
template <typename scalar_t>
+scalar_t d_sigmoid(scalar_t z) {
+  const auto s = sigmoid(z);
+  return (1.0f - s) * s;
+}
+
+template <typename scalar_t>
+scalar_t d_tanh(scalar_t z) {
+  const auto t = tanh(z);
+  return 1.0f - (t * t);
+}
+
+template <typename scalar_t>
+scalar_t elu(scalar_t z, scalar_t alpha = 1.0f) {
+  return fmax(0.0f, z) + fmin(0.0f, alpha * (exp(z) - 1.0f));
+}
+
+template <typename scalar_t>
+scalar_t d_elu(scalar_t z, scalar_t alpha = 1.0f) {
+  const auto e = exp(z);
+  const auto d_relu = z < 0.0f ? 0.0f : 1.0f;
+  return d_relu + (((alpha * (e - 1.0f)) < 0.0f) ? (alpha * e) : 0.0f);
+}
+
+
+

Now we can implement the actual code for our extension with two functions in DPC++:

+
    +
  • a function that performs operations we don’t wish to explicitly write by hand and calls into the function to submit the SYCL kernel,

  • +
  • a function that actual submits the SYCL kernel to the XPU device for the parts we want to speed up.

  • +
+

For the forward pass, the first function looks like this:

+
std::vector<torch::Tensor> lltm_xpu_forward(
+        torch::Tensor input,
+        torch::Tensor weights,
+        torch::Tensor bias,
+        torch::Tensor old_h,
+        torch::Tensor old_cell) {
+  auto X = torch::cat({old_h, input}, /*dim=*/1);
+  auto gates = torch::addmm(bias, X, weights.transpose(0, 1));
+
+  const auto batch_size = old_cell.size(0);
+  const auto state_size = old_cell.size(1);
+
+  auto new_h = torch::zeros_like(old_cell);
+  auto new_cell = torch::zeros_like(old_cell);
+  auto input_gate = torch::zeros_like(old_cell);
+  auto output_gate = torch::zeros_like(old_cell);
+  auto candidate_cell = torch::zeros_like(old_cell);
+
+  AT_DISPATCH_FLOATING_TYPES(gates.type(), "lltm_forward_xpu", ([&] {
+    lltm_xpu_forward_kernel<scalar_t>(
+          gates.data<scalar_t>(),
+                  old_cell.data<scalar_t>(),
+                  new_h.data<scalar_t>(),
+                  new_cell.data<scalar_t>(),
+                  input_gate.data<scalar_t>(),
+                  output_gate.data<scalar_t>(),
+                  candidate_cell.data<scalar_t>(),
+                  state_size,
+                  batch_size);
+  }));
+
+  return {new_h, new_cell, input_gate, output_gate, candidate_cell, X, gates};
+}
+
+
+

The purpose of AT_DISPATCH_FLOATING_TYPES is to take care of this dispatch for us. It takes a type (gates.type() in our case), a name (for error messages) and a lambda function. Inside this lambda function, the type alias scalar_t is available and is defined as the type that the tensor actually is at runtime in that context. As such, if we have a template function (which will submit the actual SYCL kernel), we can instantiate it with this scalar_t alias, and the correct function will be called. In this case, we also want to retrieve the data pointers of the tensors as pointers of that scalar_t type. If you wanted to dispatch over all types and not just floating point types (Float and Double), you can use AT_DISPATCH_ALL_TYPES.

+

Here’s how to submit the actual kernel to the XPU device:

+
template <typename scalar_t>
+void lltm_xpu_forward_kernel(
+        const scalar_t* gates,
+        const scalar_t* old_cell,
+        scalar_t* new_h,
+        scalar_t* new_cell,
+        scalar_t* input_gate,
+        scalar_t* output_gate,
+        scalar_t* candidate_cell,
+        size_t state_size,
+        size_t batch_size) {
+
+  const int threads = 1024;
+  const int work_groups = (state_size + threads - 1) / threads;
+
+  // define the kernel
+  auto cgf = [&](sycl::handler& cgh) {
+    auto kfn = [=](sycl::nd_item<2> item) {
+
+      const int column = item.get_group(0) * item.get_group_range(0) + item.get_local_id(0);
+      const int index = item.get_group(1) * state_size + column;
+      const int gates_row = item.get_group(1) * (state_size * 3);
+
+      if (column < state_size) {
+        input_gate[index] = sigmoid(gates[gates_row + column]);
+        output_gate[index] = sigmoid(gates[gates_row + state_size + column]);
+        candidate_cell[index] = elu(gates[gates_row + 2 * state_size + column]);
+        new_cell[index] =
+                old_cell[index] + candidate_cell[index] * input_gate[index];
+        new_h[index] = tanh(new_cell[index]) * output_gate[index];
+      }
+
+    };
+
+    cgh.parallel_for(
+            sycl::nd_range<2>(
+                    sycl::range<2>(work_groups * threads, batch_size),
+                    sycl::range<2>(threads, 1)),
+            kfn);
+  };
+
+  // submit kernel
+  c10::xpu::XPUStream stream = c10::xpu::getCurrentXPUStream();
+  stream.queue().submit(cgf);
+}
+
+
+

We’re specifying that each work group has 1024 threads and that the entire GPU grid is split into as many work groups of 1 x 1024 threads as are required to fill our matrices with one thread per component. For example, if our state size was 2048 and our batch size 4, we’d launch a total of 4 x 2 = 8 work groups with 1024 threads each. If you are not familiar with the SYCL “work groups”, an introductory read about SYCL may help.

+

Note that the c10::impl::VirtualGuardImpl must get the current stream of the current XPU device and use the XPU API to get the corresponding SYCL underlaying queue. It can then submit the kernel to the queue for execution.

+
+

Using accessors

+

You can see in the SYCL kernel that we work directly on pointers with the right type. Indeed, working directly with high level type agnostic tensors inside SYCL kernels would be very inefficient.

+

However, this comes at a cost of ease of use and readability, especially for highly dimensional data. We can use torch’s C++ utils to abstract access to high dimension data in the SYCL kernel directly.

+

The backwards pass follows much the same pattern but with the torch::PackedTensorAccessor32. You can get more information about these utils in torch documents:

+
template <typename scalar_t>
+void lltm_xpu_backward_kernel(
+        torch::PackedTensorAccessor32<scalar_t,2> d_old_cell,
+        torch::PackedTensorAccessor32<scalar_t,3> d_gates,
+        const torch::PackedTensorAccessor32<scalar_t,2> grad_h,
+        const torch::PackedTensorAccessor32<scalar_t,2> grad_cell,
+        const torch::PackedTensorAccessor32<scalar_t,2> new_cell,
+        const torch::PackedTensorAccessor32<scalar_t,2> input_gate,
+        const torch::PackedTensorAccessor32<scalar_t,2> output_gate,
+        const torch::PackedTensorAccessor32<scalar_t,2> candidate_cell,
+        const torch::PackedTensorAccessor32<scalar_t,3> gate_weights,
+        size_t state_size,
+        size_t batch_size) {
+
+  const int threads = 1024;
+  const int work_groups = (state_size + threads - 1) / threads;
+
+  // define the kernel
+  auto cgf = [&](sycl::handler& cgh) {
+    auto kfn = [=](sycl::nd_item<2> item) {
+      //batch index
+      const int n = item.get_group(1);
+      // column index
+      const int c = item.get_group(0) * item.get_group_range(0) + item.get_local_id(0);
+      auto d_gates_ = d_gates;
+      auto d_old_cell_ = d_old_cell;
+      if (c < d_gates.size(2)){
+        const auto d_output_gate = tanh(new_cell[n][c]) * grad_h[n][c];
+        const auto d_tanh_new_cell = output_gate[n][c] * grad_h[n][c];
+        const auto d_new_cell =
+                d_tanh(new_cell[n][c]) * d_tanh_new_cell + grad_cell[n][c];
+
+
+        d_old_cell_[n][c] = d_new_cell;
+        const auto d_candidate_cell = input_gate[n][c] * d_new_cell;
+        const auto d_input_gate = candidate_cell[n][c] * d_new_cell;
+
+        d_gates_[n][0][c] =
+                d_input_gate * d_sigmoid(gate_weights[n][0][c]);
+        d_gates_[n][1][c] =
+                d_output_gate * d_sigmoid(gate_weights[n][1][c]);
+        d_gates_[n][2][c] =
+                d_candidate_cell * d_elu(gate_weights[n][2][c]);
+      }
+    };
+
+    cgh.parallel_for(
+            sycl::nd_range<2>(
+                    sycl::range<2>(work_groups * threads, batch_size),
+                    sycl::range<2>(threads, 1)),
+            kfn);
+  };
+
+  // submit kernel
+  c10::xpu::XPUStream stream = c10::xpu::getCurrentXPUStream();
+  stream.queue().submit(cgf);
+}
+
+std::vector<torch::Tensor> lltm_xpu_backward(
+        torch::Tensor grad_h,
+        torch::Tensor grad_cell,
+        torch::Tensor new_cell,
+        torch::Tensor input_gate,
+        torch::Tensor output_gate,
+        torch::Tensor candidate_cell,
+        torch::Tensor X,
+        torch::Tensor gates,
+        torch::Tensor weights) {
+  auto d_old_cell = torch::zeros_like(new_cell);
+  auto d_gates = torch::zeros_like(gates);
+
+  const auto batch_size = new_cell.size(0);
+  const auto state_size = new_cell.size(1);
+
+  AT_DISPATCH_FLOATING_TYPES(X.type(), "lltm_backward_xpu", ([&] {
+    lltm_xpu_backward_kernel<scalar_t>(
+          d_old_cell.packed_accessor32<scalar_t,2>(),
+                  d_gates.packed_accessor32<scalar_t,3>(),
+                  grad_h.packed_accessor32<scalar_t,2>(),
+                  grad_cell.packed_accessor32<scalar_t,2>(),
+                  new_cell.packed_accessor32<scalar_t,2>(),
+                  input_gate.packed_accessor32<scalar_t,2>(),
+                  output_gate.packed_accessor32<scalar_t,2>(),
+                  candidate_cell.packed_accessor32<scalar_t,2>(),
+                  gates.packed_accessor32<scalar_t,3>(),
+                  state_size,
+                  batch_size);
+  }));
+
+  auto d_gate_weights = d_gates.reshape({batch_size, 3*state_size});
+  auto d_weights = d_gate_weights.t().mm(X);
+  auto d_bias = d_gate_weights.sum(/*dim=*/0, /*keepdim=*/true);
+
+  auto d_X = d_gate_weights.mm(weights);
+  auto d_old_h = d_X.slice(/*dim=*/1, 0, state_size);
+  auto d_input = d_X.slice(/*dim=*/1, state_size);
+
+  return {d_old_h, d_input, d_weights, d_bias, d_old_cell, d_gates};
+}
+
+
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/FSDP.html b/xpu/2.3.110+xpu/tutorials/features/FSDP.html new file mode 100644 index 000000000..5bbf4f206 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/FSDP.html @@ -0,0 +1,467 @@ + + + + + + + Fully Sharded Data Parallel (FSDP) — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Fully Sharded Data Parallel (FSDP)

+
+

Introduction

+

Fully Sharded Data Parallel (FSDP) is a PyTorch* module that provides industry-grade solution for large model training. FSDP is a type of data parallel training, unlike DDP, where each process/worker maintains a replica of the model, FSDP shards model parameters, optimizer states and gradients across DDP ranks to reduce the GPU memory footprint used in training. This makes the training of some large-scale models feasible. Please refer to FSDP Tutorial for an introduction to FSDP.

+

To run FSDP on GPU, similar to DDP, we use Intel® oneCCL Bindings for Pytorch* (formerly known as torch-ccl) to implement the PyTorch c10d ProcessGroup API (https://github.com/intel/torch-ccl). It holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library* (oneCCL), a library for efficient distributed deep learning training implementing collectives such as AllGather, ReduceScatter, and other needed by FSDP. Refer to oneCCL Github page for more information about oneCCL. +To install Intel® oneCCL Bindings for Pytorch*, follow the same installation steps as for DDP.

+
+
+

FSDP Usage (GPU only)

+

FSDP is designed to align with PyTorch conventions. To use FSDP with Intel® Extension for PyTorch*, make the following modifications to your model script:

+
    +
  • Import the necessary packages.

  • +
+
import torch
+import intel_extension_for_pytorch 
+import oneccl_bindings_for_pytorch
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+
+
+
    +
  • Initialize the process group with ccl backend.

  • +
+
dist.init_process_group(backend='ccl')
+
+
+
    +
  • For FSDP with each process exclusively working on a single GPU, set the device ID as local rank.

  • +
+
torch.xpu.set_device("xpu:{}".format(rank))
+# or
+device = "xpu:{}".format(args.local_rank)
+torch.xpu.set_device(device)
+
+
+
    +
  • Wrap model by FSDP.

  • +
+
model = model.to(device)
+model = FSDP(model, device_id=device)
+
+
+

Note: for FSDP with XPU, you need to specify device_ids with XPU device; otherwise, it will trigger the CUDA path and throw an error.

+
+
+

Example

+

Here’s an example based on PyTorch FSDP Tutorial to illustrate the usage of FSDP on XPU and the necessary changes to switch from CUDA to an XPU case.

+
    +
  • Import necessary packages:

  • +
+
"""
+Import Intel® extension for Pytorch\* and Intel® oneCCL Bindings for Pytorch\*
+"""
+import os
+import argparse
+import functools
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.optim as optim
+
+# Import Intel® extension for Pytorch\* and Intel® oneCCL Bindings for Pytorch\*
+import intel_extension_for_pytorch
+import oneccl_bindings_for_pytorch
+
+from torchvision import datasets, transforms
+from torch.optim.lr_scheduler import StepLR
+
+import torch.distributed as dist
+import torch.multiprocessing as mp
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.utils.data.distributed import DistributedSampler
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp.fully_sharded_data_parallel import (
+    CPUOffload,
+    BackwardPrefetch,
+)
+from torch.distributed.fsdp.wrap import (
+    size_based_auto_wrap_policy,
+    enable_wrap,
+    wrap,
+)
+
+
+
    +
  • Set up distributed training:

  • +
+
"""
+Set the initialize the process group backend as Intel® oneCCL Bindings for Pytorch\*
+"""
+def setup(rank, world_size):
+    os.environ['MASTER_ADDR'] = 'localhost'
+    os.environ['MASTER_PORT'] = '12355'
+
+    # initialize the process group by Intel® oneCCL Bindings for Pytorch\*
+    dist.init_process_group("ccl", rank=rank, world_size=world_size)
+
+def cleanup():
+    dist.destroy_process_group()
+
+
+
    +
  • Define the toy model for handwritten digit classification:

  • +
+
class Net(nn.Module):
+    def __init__(self):
+        super(Net, self).__init__()
+        self.conv1 = nn.Conv2d(1, 32, 3, 1)
+        self.conv2 = nn.Conv2d(32, 64, 3, 1)
+        self.dropout1 = nn.Dropout(0.25)
+        self.dropout2 = nn.Dropout(0.5)
+        self.fc1 = nn.Linear(9216, 128)
+        self.fc2 = nn.Linear(128, 10)
+
+    def forward(self, x):
+
+        x = self.conv1(x)
+        x = F.relu(x)
+        x = self.conv2(x)
+        x = F.relu(x)
+        x = F.max_pool2d(x, 2)
+        x = self.dropout1(x)
+        x = torch.flatten(x, 1)
+        x = self.fc1(x)
+        x = F.relu(x)
+        x = self.dropout2(x)
+        x = self.fc2(x)
+        output = F.log_softmax(x, dim=1)
+        return output
+
+
+
    +
  • Define a training function:

  • +
+
"""
+Change the device related logic from 'rank' to '"xpu:{}".format(rank)'
+"""
+def train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=None):
+    model.train()
+    # XPU device should be formatted as string, replace the rank with '"xpu:{}".format(rank)'
+    ddp_loss = torch.zeros(2).to("xpu:{}".format(rank))
+    if sampler:
+        sampler.set_epoch(epoch)
+    for batch_idx, (data, target) in enumerate(train_loader):
+        data, target = data.to("xpu:{}".format(rank)), target.to("xpu:{}".format(rank))
+        optimizer.zero_grad()
+        output = model(data)
+        loss = F.nll_loss(output, target, reduction='sum')
+        loss.backward()
+        optimizer.step()
+        ddp_loss[0] += loss.item()
+        ddp_loss[1] += len(data)
+
+    dist.all_reduce(ddp_loss, op=dist.ReduceOp.SUM)
+    if rank == 0:
+        print('Train Epoch: {} \tLoss: {:.6f}'.format(epoch, ddp_loss[0] / ddp_loss[1]))
+
+
+
    +
  • Define a validation function:

  • +
+
"""
+Change the device related logic from 'rank' to '"xpu:{}".format(rank)'
+"""
+def test(model, rank, world_size, test_loader):
+    model.eval()
+    correct = 0
+    # XPU device should be formatted as string, replace the rank with '"xpu:{}".format(rank)'
+    ddp_loss = torch.zeros(3).to("xpu:{}".format(rank))
+    with torch.no_grad():
+        for data, target in test_loader:
+            data, target = data.to("xpu:{}".format(rank)), target.to("xpu:{}".format(rank))
+            output = model(data)
+            ddp_loss[0] += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
+            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
+            ddp_loss[1] += pred.eq(target.view_as(pred)).sum().item()
+            ddp_loss[2] += len(data)
+
+    dist.all_reduce(ddp_loss, op=dist.ReduceOp.SUM)
+
+    if rank == 0:
+        test_loss = ddp_loss[0] / ddp_loss[2]
+        print('Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format(
+            test_loss, int(ddp_loss[1]), int(ddp_loss[2]),
+            100. * ddp_loss[1] / ddp_loss[2]))
+
+
+
    +
  • Define a distributed training function that wraps the model in FSDP:

  • +
+
"""
+Change the device related logic from 'rank' to '"xpu:{}".format(rank)'.
+Specify the argument `device_ids` as XPU device ("xpu:{}".format(rank)) in FSDP API.
+"""
+def fsdp_main(rank, world_size, args):
+    setup(rank, world_size)
+
+    transform=transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.1307,), (0.3081,))
+    ])
+
+    dataset1 = datasets.MNIST('../data', train=True, download=True,
+                        transform=transform)
+    dataset2 = datasets.MNIST('../data', train=False,
+                        transform=transform)
+
+    sampler1 = DistributedSampler(dataset1, rank=rank, num_replicas=world_size, shuffle=True)
+    sampler2 = DistributedSampler(dataset2, rank=rank, num_replicas=world_size)
+
+    train_kwargs = {'batch_size': args.batch_size, 'sampler': sampler1}
+    test_kwargs = {'batch_size': args.test_batch_size, 'sampler': sampler2}
+    xpu_kwargs = {'num_workers': 2,
+                    'pin_memory': True,
+                    'shuffle': False}
+    train_kwargs.update(xpu_kwargs)
+    test_kwargs.update(xpu_kwargs)
+
+    train_loader = torch.utils.data.DataLoader(dataset1,**train_kwargs)
+    test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)
+    my_auto_wrap_policy = functools.partial(
+        size_based_auto_wrap_policy, min_num_params=100
+    )
+    torch.xpu.set_device("xpu:{}".format(rank))
+
+
+    init_start_event = torch.xpu.Event(enable_timing=True)
+    init_end_event = torch.xpu.Event(enable_timing=True)
+
+    model = Net().to("xpu:{}".format(rank))
+    # Specify the argument `device_ids` as XPU device ("xpu:{}".format(rank)) in FSDP API.
+    model = FSDP(model, device_id="xpu:{}".format(rank))
+
+    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)
+
+    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
+    init_start_event.record()
+    for epoch in range(1, args.epochs + 1):
+        train(args, model, rank, world_size, train_loader, optimizer, epoch, sampler=sampler1)
+        test(model, rank, world_size, test_loader)
+        scheduler.step()
+
+    init_end_event.record()
+
+    if rank == 0:
+        print(f"XPU event elapsed time: {init_start_event.elapsed_time(init_end_event) / 1000}sec")
+        print(f"{model}")
+
+    if args.save_model:
+        # use a barrier to make sure training is done on all ranks
+        dist.barrier()
+        states = model.state_dict()
+        if rank == 0:
+            torch.save(states, "mnist_cnn.pt")
+
+    cleanup()
+
+
+
    +
  • Finally, parse the arguments and set the main function:

  • +
+
"""
+Replace CUDA runtime API with XPU runtime API.
+"""
+if __name__ == '__main__':
+    # Training settings
+    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
+    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
+                        help='input batch size for training (default: 64)')
+    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
+                        help='input batch size for testing (default: 1000)')
+    parser.add_argument('--epochs', type=int, default=10, metavar='N',
+                        help='number of epochs to train (default: 14)')
+    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
+                        help='learning rate (default: 1.0)')
+    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
+                        help='Learning rate step gamma (default: 0.7)')
+    parser.add_argument('--no-cuda', action='store_true', default=False,
+                        help='disables CUDA training')
+    parser.add_argument('--seed', type=int, default=1, metavar='S',
+                        help='random seed (default: 1)')
+    parser.add_argument('--save-model', action='store_true', default=False,
+                        help='For Saving the current Model')
+    args = parser.parse_args()
+
+    torch.manual_seed(args.seed)
+
+    WORLD_SIZE = torch.xpu.device_count()
+    mp.spawn(fsdp_main,
+        args=(WORLD_SIZE, args),
+        nprocs=WORLD_SIZE,
+        join=True)
+
+
+
    +
  • Put the above code snippets to a python script FSDP_mnist_xpu.py, and run:

  • +
+
python FSDP_mnist_xpu.py
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + diff --git a/xpu/2.3.110+xpu/tutorials/features/advanced_configuration.html b/xpu/2.3.110+xpu/tutorials/features/advanced_configuration.html new file mode 100644 index 000000000..7b54a0122 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/advanced_configuration.html @@ -0,0 +1,415 @@ + + + + + + + Advanced Configuration — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Advanced Configuration

+

The default settings for Intel® Extension for PyTorch* are sufficient for most use cases. However, if users want to customize Intel® Extension for PyTorch*, advanced configuration is available at build time and runtime.

+
+

Build Time Configuration

+

The following build options are supported by Intel® Extension for PyTorch*. Users who install Intel® Extension for PyTorch* via source compilation could override the default configuration by explicitly setting a build option ON or OFF, and then build.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Build OptionDefault
Value
Description
USE_ONEMKLONUse oneMKL BLAS
USE_CHANNELS_LAST_1DONUse channels last 1d
USE_PERSIST_STREAMONUse persistent oneDNN stream
USE_SCRATCHPAD_MODEONUse oneDNN scratchpad mode
USE_PRIMITIVE_CACHEONCache oneDNN primitives by FRAMEWORK for specific operators
USE_QUEUE_BARRIERONUse queue submit_barrier, otherwise use dummy kernel
USE_PTIONBuild XPU Profiler with PTI support.
USE_DS_KERNELSONBuild deepspeed kernels
USE_SYCL_ASSERTOFFEnables assert in sycl kernel
USE_ITT_ANNOTATIONOFFEnables ITT annotation in sycl kernel
USE_SPLIT_FP64_LOOPSONSplit FP64 loops into separate kernel for element-wise kernels
BUILD_BY_PER_KERNELOFFBuild by DPC++ per_kernel option (exclusive with USE_AOT_DEVLIST)
BUILD_INTERNAL_DEBUGOFFUse internal debug code path
BUILD_SEPARATE_OPSOFFBuild each operator in separate library
BUILD_SIMPLE_TRACEONBuild simple trace for each registered operator
USE_AOT_DEVLIST""Set device list for AOT build
USE_XETLA"ON"Use XeTLA based customer kernels; Specify a comma-sep list of gpu architectures (e.g. xe_lpg,xe_hpg) to only enable kernels for specific platforms
USE_ONEDNN_DIR""Specify oneDNN source path which contains its include directory and lib directory
USE_XETLA_SRC"${IPEX_GPU_ROOT_DIR}/aten/operators/xetla/kernels/"Specify XETLA source path which contains its include dir
BUILD_OPT_LEVEL""Add build option -Ox, accept values: 0/1
BUILD_WITH_SANITIZER""Build with sanitizer check. Support one of address, thread, and leak options at a time. The default option is address.

For above build options which can be configured to ON or OFF, users can configure them to 1 or 0 also, while ON equals to 1 and OFF equals to 0.

+
+
+

Runtime Configuration

+

The following launch options are supported in Intel® Extension for PyTorch*. Users who execute AI models on XPU could override the default configuration by explicitly setting the option value at runtime using environment variables, and then launch the execution.

+ + + + + + + + + + + + + + + +
Launch Option
CPU, GPU
Default
Value
Description
IPEX_FP32_MATH_MODEFP32Set values for FP32 math mode (valid values: FP32, TF32, BF32). Refer to API Documentation for details.
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Launch Option
GPU ONLY
Default
Value
Description
IPEX_VERBOSE0Set verbose level with synchronization execution mode, will be deprecated very soon. Please use IPEX_LOG_LEVEL instead.
IPEX_XPU_SYNC_MODE0Set 1 to enforce synchronization execution mode, will be deprecated very soon.
IPEX_LOG_LEVEL-1Set log level to trace the execution and get log information, pls refer to 'ipex_log.md' for different log level.
IPEX_LOG_COMPONENT"ALL"Set IPEX_LOG_COMPONENT = ALL to log all component message. Use ';' as separator to log more than one components, such as "OPS;RUNTIME". Use '/' as separator to log subcomponents.
IPEX_LOG_ROTATE_SIZE-1Set Rotate file size in MB for IPEX_LOG, less than 0 means unuse this setting.
IPEX_LOG_SPLIT_SIZE-1Set split file size in MB for IPEX_LOG, less than 0 means unuse this setting.
IPEX_LOG_OUTPUT""Set output file path for IPEX_LOG, default is null
+ + + + + + + + + + + + + + +
Launch Option
Experimental
Default
Value
Description
IPEX_SIMPLE_TRACE0Set 1 to enable simple trace for all operators*, will be deprecated very soon. Please use IPEX_LOG_LEVEL instead.
+ + + + + + + + + + + + + + + + + + + + + + + + +
Distributed Option
GPU ONLY
Default
Value
Description
TORCH_LLM_ALLREDUCE0This is a prototype feature to provide better scale-up performance by enabling optimized collective algorithms in oneCCL and asynchronous execution in torch-ccl. This feature requires XeLink enabled for cross-cards communication. By default, this feature is not enabled with setting 0.
CCL_BLOCKING_WAIT0This is a prototype feature to control over whether collectives execution on XPU is host blocking or non-blocking. By default, setting 0 enables blocking behavior.
CCL_SAME_STREAM0This is a prototype feature to allow using a computation stream as communication stream to minimize overhead for streams synchronization. By default, setting 0 uses separate streams for communication.

For above launch options which can be configured to 1 or 0, users can configure them to ON or OFF also, while ON equals to 1 and OFF equals to 0.

+

Examples to configure the launch options:

+
    +
  • Set one or more options before running the model

  • +
+
export IPEX_LOG_LEVEL=1
+export IPEX_FP32_MATH_MODE=TF32
+...
+python ResNet50.py
+
+
+
    +
  • Set one option when running the model

  • +
+
IPEX_LOG_LEVEL=1 python ResNet50.py
+
+
+
    +
  • Set more than one options when running the model

  • +
+
IPEX_LOG_LEVEL=1 IPEX_FP32_MATH_MODE=TF32 python ResNet50.py
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/amp_gpu.html b/xpu/2.3.110+xpu/tutorials/features/amp_gpu.html new file mode 100644 index 000000000..e0a0a4224 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/amp_gpu.html @@ -0,0 +1,279 @@ + + + + + + + Auto Mixed Precision (AMP) on GPU — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Auto Mixed Precision (AMP) on GPU

+
+

Introduction

+

torch.xpu.amp provides convenience for auto data type conversion at runtime. Deep learning workloads can benefit from lower-precision floating point data types such as torch.float16 or torch.bfloat16, because of its lighter calculation workload and smaller memory usage. Accuracy is sacrificed when using lower-precision floating point data types so there’s a trade-off between accuracy and performance. Thus, some operations should use the slower but more accurate torch.float32, while others can be converted to use the faster but less accurate torch.float16 or torch.bfloat16 data type. The Auto Mixed Precision (AMP) feature automates the tuning of data type conversions over all operators.

+

Inference workloads using torch.xpu.amp support torch.bfloat16 and torch.float16. Training workloads using torch.xpu.amp support torch.bfloat16. torch.bfloat16 is the default lower precision floating point data type when torch.xpu.amp is enabled.

+
+
+

Use Case

+

The following simple network should show a speedup with mixed precision.

+
class SimpleNet(torch.nn.Module):
+    def __init__(self):
+        super(SimpleNet, self).__init__()
+        self.conv = torch.nn.Conv2d(64, 128, (3, 3), stride=(2, 2), padding=(1, 1), bias=False)
+
+    def forward(self, x):
+        return self.conv(x)
+
+
+
+

Default Precision

+

Without torch.xpu.amp, the network executes all operators with default precision (torch.float32).

+
model = SimpleNet().to("xpu")
+x = torch.rand(64, 64, 224, 224).to("xpu")
+y = model(x)
+
+
+
+
+

Inference with Imperative Path

+

torch.xpu.amp.autocast is designed to be a context manager that allow scopes of your script to run with mixed precision. In these scopes, operations run in a data type chosen by the autocast class to improve performance while maintaining accuracy. See the operations category section for details on what precision the autocast class chooses for each operator, and under what circumstances.

+
model = SimpleNet().to("xpu").eval()
+x = torch.rand(64, 64, 224, 224).to("xpu")
+with torch.xpu.amp.autocast(dtype=torch.float16):
+    y = model(x)
+
+
+
+
+

Inference with TorchScript Path

+

torch.xpu.amp.autocast can be used with torch.jit.trace to apply graph optimization. Due to PyTorch limitation, only torch.jit.trace is supported.

+
model = SimpleNet().to("xpu").eval()
+x = torch.rand(64, 64, 224, 224).to("xpu")
+with torch.xpu.amp.autocast(dtype=torch.float16):
+    model = torch.jit.trace(model, x)
+    model = torch.jit.freeze(model)
+    y = model(x)
+
+
+
+
+

Training Support

+

torch.xpu.amp.autocast can be used in training to improve performance.

+
model = SimpleNet().to("xpu")
+optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
+for images, label in train_loader():
+    with torch.xpu.amp.autocast():
+        loss = criterion(model(images.to("xpu")), label.to("xpu"))
+    loss.backward()
+    optimizer.step()
+
+
+
+
+
+

Autocast Op Reference

+
+

Op Eligibility

+

Ops that run in float64 or non-floating-point dtypes are not eligible for mixed precision, and will run in these types whether or not autocast is enabled.

+

Only out-of-place ops and Tensor methods are eligible for mixed precision. In-place variants and calls that explicitly supply an out=... Tensor +are allowed in autocast-enabled regions, but won’t go through autocasting. For example, in an autocast-enabled region a.addmm(b, c) can autocast, but a.addmm_(b, c) and a.addmm(b, c, out=d) cannot. For best performance and stability, use out-of-place ops in autocast-enabled regions.

+
+
+

Op-Specific Behavior

+

The following lists describe the behavior of eligible ops in autocast-enabled regions. These ops always go through autocasting whether they are invoked as part of a torch.nn.Module, as a function, or as a torch.Tensor method. If functions are exposed in multiple namespaces, they go through autocasting regardless of the namespace.

+

Ops not listed below do not go through autocasting. They run in the type defined by their inputs. However, autocasting may still change the type in which unlisted ops run if they’re downstream from autocasted ops.

+

If an op is unlisted, we assume it’s numerically stable in bfloat16 or float16. If you believe that an unlisted op is numerically unstable in bfloat16 or float16, file a GitHub issue.

+
+

Ops that can autocast to bfloat16

+

conv1d, conv2d, conv3d, _convolution, convolution, conv_tbc, conv_transpose1d, conv_transpose1d, conv_transpose3d, prelu, addmm, addmv, addr, linear, matmul, mm, mv, bmm, baddbmm, addbmm, chain_matmul, linalg_multi_dot, _thnn_fused_gru_cell, gru_cell, scaled_dot_product_attention

+
+
+

Ops that can autocast to float16

+

conv1d, conv2d, conv3d, _convolution, convolution, conv_tbc, conv_transpose1d, conv_transpose1d, conv_transpose3d, prelu, addmm, addmv, addr, linear, matmul, mm, mv, bmm, baddbmm, addbmm, chain_matmul, linalg_multi_dot, _thnn_fused_gru_cell, gru_cell, scaled_dot_product_attention

+
+
+

Ops that can autocast to float32

+

binary_cross_entropy, binary_cross_entropy_with_logits, log_softmax, nll_loss, nll_loss2d, nll_loss_nd, cross_entropy_loss, fft_fft, fft_ifft, fft_fft2, fft_ifft2, fft_fftn, fft_ifftn, fft_rfft, fft_irfft, fft_rfft2, fft_irfft2, fft_rfftn, fft_irfftn, fft_hfft, fft_ihfft, reciprocal, pow, frobenius_norm, nuclear_norm, cosine_similarity, poisson_nll_loss, cosine_embedding_loss, hinge_embedding_loss, kl_div, l1_loss, smooth_l1_loss , huber_loss, mse_loss, margin_ranking_loss, multilabel_margin_loss, soft_margin_loss, triplet_margin_loss, multi_margin_loss, dist, pdist, cdist, renorm

+
+
+

Ops that promote to the widest input type

+

These ops don’t require a particular dtype for stability, but take multiple inputs and require that the inputs’ dtypes match. If all of the inputs are bfloat16, the op runs in bfloat16. If any of the inputs is float32, autocast casts all inputs to float32 and runs the op in float32.

+

cat, stack, addcdiv, addcmul, atan2, bilinear, cross, dot, grid_sampler, index_put, tensordot, scatter_add

+

Some ops not listed here (e.g., binary ops such as add) natively promote inputs without autocasting’s intervention. If inputs are a mixture of bfloat16 and float32, these ops run in float32 and produce float32 output, regardless of whether autocast is enabled.

+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/auto_channels_last.html b/xpu/2.3.110+xpu/tutorials/features/auto_channels_last.html new file mode 100644 index 000000000..da4e801eb --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/auto_channels_last.html @@ -0,0 +1,206 @@ + + + + + + + Auto Channels Last — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Auto Channels Last

+

Channels last memory format is known to have performance advantage over channels first memory format. Refer to Channels Last for details. +Intel® Extension for PyTorch* automatically converts the model to channels last memory format by default when users optimize their model with ipex.optimize(model).

+
+

Ease-of-use auto channels last API

+

Note: Auto channels last APIs ipex.enable_auto_channels_last() and ipex.disable_auto_channels_last() will be deprecated in future releases.

+
+

default

+
model = ipex.optimize(model) # by default, model is channels last
+
+
+
+
+

enable

+
ipex.enable_auto_channels_last() # This API will be deprecated in future releases.
+model = ipex.optimize(model) # enable, model is channels last
+
+
+
+
+

disable

+
ipex.disable_auto_channels_last() # This API will be deprecated in future releases.
+model = ipex.optimize(model) # disable, model is channels first 
+
+
+
+
+
+

Known issue

+

For broad models, channels last memory format brings performance boost over channels first memory format. However, for few use cases, this may bring performance regression. If performance regression is observed, we recommend to feed sample input data to ipex.optimize(model, sample_input=...).

+
model = ipex.optimize(model, sample_input=...)
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/compute_engine.html b/xpu/2.3.110+xpu/tutorials/features/compute_engine.html new file mode 100644 index 000000000..c4192d3c1 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/compute_engine.html @@ -0,0 +1,217 @@ + + + + + + + Compute Engine (Experimental feature for debug) — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Compute Engine (Experimental feature for debug)

+
+

Introduction

+

Compute engine provides the capacity to choose specific backend for operators with multiple implementations. For example, with compute engine set, we can prefer to using SYCL than oneDNN implementation for concatenation. The feature can help user to customize model forward behavior for better performance or special requirement.

+

We currently support 5 compute engines, namely, RECOMMEND, BASIC, ONEDNN, ONEMKL, XETLA. Each op with multiple implementations has a recommend one based on our empirical experience. The RECOMMEND engine would guarantee performance on most shape input ideally. BASIC engines refers to SYCL implementation. ONEDNN, ONEMKL, XETLA refers to optimized implementation provided by library Intel® oneAPI Deep Neural Network Library (oneDNN), Intel® oneAPI Math Kernel Library (oneMKL) and Intel® Xe Templates for Linear Algebra.

+
+
+

Use Case

+

Code snippet below demonstrates the usage of compute engine feature to select oneDNN as the compute engine of operator torch.cat.

+
with torch.xpu.compute_eng(torch.xpu.XPUComputeEng.ONEDNN):
+    x1 = torch.randn((1, 3, 20, 20), device="xpu")
+    x2 = torch.randn((1, 5, 20, 20), device="xpu")
+    torch.cat([x1, x2], dim=1)
+
+
+
+
+

Engine Selection Policy

+

Generally, priority of choosing engine follows the order operator special argument > onednn_layout format input > user set engine > recommend engine. Check the following for details:

+

Step 1: In some cases, operators with specific arguments may not have implementations for all compute engines. For these operators, the implemented compute engines have the highest priority in the selection process. For example, operator torch.nn.Upsample with argument align_corners=True has only SYCL implementation for GPU. Thus, the BASIC engine, referring to SYCL implementations, is always its computing engine.

+

Step2: If no special argument, and inputs contain ONEDNN_LAYOUT Tensor, ONEDNN engine would be chosen if possible. This would utilize the highly optimized code in library oneDNN to speedup computation. If oneDNN has no support for the operator, engine selection process continues to next step.

+

Step3: If user manually set a engine, this engine is chosen once the operator supports this implementation.

+

Step4: If the compute engine designated by user is not implemented/available, execution of the operator will fall back on to the RECOMMEND engine.

+

fig-2(1)-pt-conv-layout-path-dispatch

+
+
+

Multiple Implementations Operators and Engines

+

AveragePool2d: ONEDNN, BASIC [Recommend]

+

Concat: ONEDNN, BASIC [Recommend]

+

MaxPool2d, MaxPool3d: ONEDNN, BASIC [Recommend]

+

LSTM: ONEDNN, BASIC [Recommend]

+
Basic is recommended currently. When optimizations in oneDNN finish, `ONEDNN` would be the recommend engine.
+
+
+

LayerNormONEDNN, BASIC [Recommend]

+

PermuteContiguous: ONEDNN, BASIC [Recommend]

+

SoftMax: ONEDNN, BASIC [Recommend]

+
The `BASIC` engine is always chosen if input tensor has `dimension` greater than 3 or its `dtype` is other than `fp16, fp32` or `bfloat16`.
+
+
+

UpsampleBlinear2d: ONEDNN, BASIC [Recommend]

+
The `BASIC` engine is always chosen if argument `align_corners=True`.
+
+
+

UpsampleNearest: ONEDNN, BASIC [Recommend]

+
The `ONEDNN` engine is always chosen if output shape is divisible by the input shape.
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/deepspeed_kernels.html b/xpu/2.3.110+xpu/tutorials/features/deepspeed_kernels.html new file mode 100644 index 000000000..b8932020b --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/deepspeed_kernels.html @@ -0,0 +1,153 @@ + + + + + + + Intel® Extension for PyTorch* - DeepSpeed* Kernels — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Intel® Extension for PyTorch* - DeepSpeed* Kernels

+

(intel_extension_for_pytorch.deepspeed module)

+
+

Introduction

+

DeepSpeed* creates custom kernels for its feature support and performance optimizations. The DeepSpeed custom kernels for Intel XPU device are integrated into Intel® Extension for PyTorch* under the ecological library category. It worths noting that the kernels are designed specifically for DeepSpeed* therefore it is NOT necessarily common or validated when being used in scenarios other than DeepSpeed*.

+

The DeepSpeed* kernels module provides below custom kernels for DeepSpeed*:

+
    +
  • quantization: including quantize/dequantize with fp32/fp16, etc

  • +
  • transformer inference: including the bias GeGLU, layernorm, layernorm + residual, layernorm + store pre layernorm residual, RMS norm, pre RMS norm, vector add, MLP with fp16, MoE residual matmul, reset cache, release/retake workspace etc.

  • +
+
+
+

Supported Platform

+

This module supports xpu device on Intel® Data Center GPU Max Series only.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/float8.html b/xpu/2.3.110+xpu/tutorials/features/float8.html new file mode 100644 index 000000000..001ee7071 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/float8.html @@ -0,0 +1,210 @@ + + + + + + + Float8 Data Type Support (Prototype) — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Float8 Data Type Support (Prototype)

+
+

Float8 Data Type

+

Float8 (FP8) is a 8-bit floating point data type, which is used to reduce memory footprint, improve the computation efficiency and save power in Deep Learning domain.

+

Two formats are used in FP8 training and inference, in order to meet the required value range and precision of activation, weight and gradient in Deep Neural Network (DNN). One is E4M3 (sign-exponent-mantissa) for activation and weight, the other is E5M2 for gradients. These two formats are defined in FP8 FORMATS FOR DEEP LEARNING.

+
+
+

FP8 Quantization

+

On GPU, online Dynamic Quantization is used for FP8 data compression and decompression. Delayed Scaling algorithm is used for accelerating the quantizaiton process.

+
+
+

Supported running mode

+

Both DNN Training and Inference are supported with the FP8 data type.

+
+
+

Supported operators

+

FP8 Linear operator is supported.

+
+
+

FP8 usage example

+

BERT model is supported as a FP8 training showcase, see the following example:

+
from intel_extension_for_pytorch.quantization.fp8 import (
+    fp8_autocast,
+    DelayedScaling,
+    Format,
+    FP8Linear,
+)
+
+## Convert the original model to a new model composed of FP8 operators.
+fp8_model = prepare_fp8(model)
+## Run FP8 model.
+with fp8_autocast(enabled=True, fp8_recipe=DelayedScaling()):
+    outputs = fp8_model(input_ids=input_ids,
+                    token_type_ids=segment_ids,
+                    attention_mask=input_mask,
+                    labels=masked_lm_labels,
+                    next_sentence_label=next_sentence_labels)
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/horovod.html b/xpu/2.3.110+xpu/tutorials/features/horovod.html new file mode 100644 index 000000000..97ecf8b75 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/horovod.html @@ -0,0 +1,265 @@ + + + + + + + Horovod with PyTorch (Prototype) — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Horovod with PyTorch (Prototype)

+

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. Horovod core principles are based on MPI concepts such as size, rank, local rank, allreduce, allgather, broadcast, and alltoall. To use Horovod with PyTorch, you need to install Horovod with Pytorch first, and make specific change for Horovod in your training script.

+
+

Install Horovod with PyTorch

+

You can use normal pip command to install Intel® Optimization for Horovod*:

+
python -m pip install intel-optimization-for-horovod
+
+
+

Note: Make sure you already install oneAPI basekit. You need to activate the environment when use Horovod.

+
source ${HOME}/intel/oneapi/ccl/latest/env/vars.sh
+
+
+
+
+

Horovod with PyTorch Usage

+

To use Horovod with PyTorch for XPU backend, make the following modifications to your training script:

+
    +
  1. Initialize Horovod.

    +
     import torch
    + import intel_extension_for_pytorch
    + import horovod.torch as hvd
    + hvd.init()
    +
    +
    +
  2. +
  3. Pin each GPU to a single process.

    +

    With the typical setup of one GPU per process, set this to local rank. The first process on +the server will be allocated the first GPU, the second process will be allocated the second GPU, and so forth.

    +
     devid = hvd.local_rank()
    + torch.xpu.set_device(devid)
    +
    +
    +
  4. +
  5. Scale the learning rate by the number of workers.

    +

    Effective batch size in synchronous distributed training is scaled by the number of workers. +An increase in learning rate compensates for the increased batch size.

    +
  6. +
  7. Wrap the optimizer in hvd.DistributedOptimizer.

    +

    The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies those averaged gradients.

    +
  8. +
  9. Broadcast the initial variable states from rank 0 to all other processes:

    +
     hvd.broadcast_parameters(model.state_dict(), root_rank=0)
    + hvd.broadcast_optimizer_state(optimizer, root_rank=0)
    +
    +
    +

    This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.

    +
  10. +
  11. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.

    +

    Accomplish this by guarding model checkpointing code with hvd.rank() != 0.

    +
  12. +
+

Example:

+
import torch
+import intel_extension_for_pytorch
+import horovod.torch as hvd
+
+# Initialize Horovod
+hvd.init()
+
+# Pin GPU to be used to process local rank (one GPU per process)
+devid = hvd.local_rank()
+torch.xpu.set_device(devid)
+device = "xpu:{}".format(devid)
+
+# Define dataset...
+train_dataset = ...
+
+# Partition dataset among workers using DistributedSampler
+train_sampler = torch.utils.data.distributed.DistributedSampler(
+    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
+
+train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)
+
+# Build model...
+model = ...
+model.to(device)
+
+optimizer = optim.SGD(model.parameters())
+
+# Add Horovod Distributed Optimizer
+optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
+
+# Broadcast parameters from rank 0 to all other processes.
+hvd.broadcast_parameters(model.state_dict(), root_rank=0)
+
+for epoch in range(100):
+   for batch_idx, (data, target) in enumerate(train_loader):
+       optimizer.zero_grad()
+       output = model(data)
+       loss = F.nll_loss(output, target)
+       loss.backward()
+       optimizer.step()
+       if batch_idx % args.log_interval == 0:
+           print('Train Epoch: {} [{}/{}]\tLoss: {}'.format(
+               epoch, batch_idx * len(data), len(train_sampler), loss.item()))
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/int8_overview_xpu.html b/xpu/2.3.110+xpu/tutorials/features/int8_overview_xpu.html new file mode 100644 index 000000000..d31869bf3 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/int8_overview_xpu.html @@ -0,0 +1,257 @@ + + + + + + + Intel® Extension for PyTorch* Optimizations for Quantization [GPU] — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Intel® Extension for PyTorch* Optimizations for Quantization [GPU]

+

Intel® Extension for PyTorch* currently supports imperative mode and TorchScript mode for post-training static quantization on GPU. This section illustrates the quantization workflow on Intel GPUs.

+

The overall view is that our usage follows the API defined in official PyTorch. Therefore, only small modification like moving model and data to GPU with to('xpu') is required. We highly recommend using the TorchScript for quantizing models. With graph model created via TorchScript, optimization like operator fusion (e.g. conv_relu) is enabled automatically. This delivers the best performance for int8 workloads.

+
+

Imperative Mode

+
import torch
+import intel_extension_for_pytorch
+
+# Define model
+model = Model().to("xpu")
+model.eval()
+modelImpe = torch.quantization.QuantWrapper(model)
+
+# Define QConfig
+qconfig = torch.quantization.QConfig(activation=torch.quantization.observer.MinMaxObserver .with_args(qscheme=torch.per_tensor_symmetric),
+    weight=torch.quantization.default_weight_observer)  # weight could also be perchannel
+
+modelImpe.qconfig = qconfig
+
+# Prepare model for inserting observer
+torch.quantization.prepare(modelImpe, inplace=True)
+
+# Calibration to obtain statistics for Observer
+for data in calib_dataset:
+    modelImpe(data)
+
+# Convert model to create a quantized module
+torch.quantization.convert(modelImpe, inplace=True)
+
+# Inference
+modelImpe(inference_data)
+
+
+

Imperative mode usage follows official Pytorch and more details can be found at PyTorch doc.

+

Defining the quantized config (QConfig) for model is the first stage of quantization. Per-tensor quantization is supported for activation quantization, while both per-tensor and per-channel are supported for weight quantization. Weight can be quantized to int8 data type only. As for activation quantization, both symmetric and asymmetric are supported. Also, both uint8 and int8 data types are supported.

+

If the best performance is desired, we recommend using the symmetric+int8 combination. Other configuration may have lower performance due to the existence of zero_point.

+

After defining a QConfig, the prepare function is used to insert observer in models. The observer is responsible for collecting statistics for quantization. A calibration stage is needed for observer to collect info.

+

After calibration, function convert would quantize weight in module and swap FP32 module to quantized ones. Then, an int8 module is created. Be free to use it for inference.

+
+
+

TorchScript Mode

+
import torch
+import intel_extension_for_pytorch
+from torch.quantization.quantize_jit import (
+    convert_jit,
+    prepare_jit,
+)
+
+# Define model
+model = Model().to("xpu")
+model.eval()
+
+# Generate a ScriptModule
+modelJit = torch.jit.trace(model, example_input) # or torch.jit.script(model)
+
+# Defin QConfig
+qconfig = torch.quantization.QConfig(
+    activation=torch.quantization.observer.MinMaxObserver.with_args(
+        qscheme=qscheme,
+        reduce_range=False,
+        dtype=dtype
+    ),
+    weight=torch.quantization.default_weight_observer
+)
+
+# Prepare model for inserting observer
+modelJit = prepare_jit(modelJit, {'': qconfig}, inplace=True)
+
+# Calibration 
+for data in calib_dataset:
+    modelJit(data)
+
+# Convert model to quantized one
+modelJit = convert_jit(modelJit)
+
+# Warmup to fully trigger fusion patterns
+for i in range(5):
+    modelJit(warmup_data) 
+# Inference
+modelJit(inference_data)
+
+# Debug
+print(modelJit.graph_for(inference_dta))
+
+
+

We need to define QConfig`` for TorchScript module, use prepare_jitfor inserting observer and useconvert_jit` for replacing FP32 modules.

+

Before prepare_jit, create a ScriptModule using torch.jit.script or torch.jit.trace. jit.trace is recommended for capable of catching the whole graph in most scenarios.

+

Fusion operations like conv_unary, conv_binary, linear_unary (e.g. conv_relu, conv_sum_relu) are automatically enabled after model conversion (convert_jit). A warmup stage is required for bringing the fusion into effect. With the benefit from fusion, ScriptModule can deliver better performance than eager mode. Hence, we recommend using ScriptModule as for performance consideration.

+

modelJit.graph_for(input) is useful to dump the inference graph and other graph related information for performance analysis.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/ipex_log.html b/xpu/2.3.110+xpu/tutorials/features/ipex_log.html new file mode 100644 index 000000000..675d7eda2 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/ipex_log.html @@ -0,0 +1,328 @@ + + + + + + + IPEX_LOG (Prototype) — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

IPEX_LOG (Prototype)

+
+

Introduction

+

IPEX_LOG provides the capability to log verbose information from Intel® Extension for PyTorch* . Please use IPEX_LOG to get the log information or trace the execution from Intel® Extension for PyTorch*. Please continue using PyTorch* macros such as TORCH_CHECK, TORCH_ERROR, etc. to get the log information from PyTorch*.

+
+
+

IPEX_LOG Definition

+
+

Log Level

+

The supported log levels are defined as follows, default log level is DISABLED:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
log levelnumberusage
DISABLED-1Disable the logging
TRACE0Reserve for further usage
DEBUG1Provide the whole calling stack info
INFO2Record calling info to other library functions and environment variable settings
WARN3Warn the second attempt of an action, such as memory reallocation
ERR4Report error in try catch
CRITICAL5Reserve for further usage
+
+

Log Component

+

Log component is used to specify which part from Intel® Extension for PyTorch* does this log information belong to. The supported log components are defined as follows:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
log componentdescription
OPSLaunch SYCL, oneDNN, oneMKL operators
SYNGRAPHSyngraph related
MEMORYAllocate/Free memory, Allocate/Free cache
RUNTIMEDevice / Queue related
ALLAll output log
+
+
+

Usage in C++

+

All the usage are defined in utils/LogUtils.h. Currently Intel® Extension for PyTorch* supports:

+
+

Simple Log

+

You can use IPEX_XXX_LOG, XXX represents the log level as mentioned above. There are four parameters defined for simple log:

+
    +
  • Log component, representing which part of Intel® Extension for PyTorch* does this log belong to.

  • +
  • Log sub component, input an empty string(””) for general usages. For SYNGRAPH you can add any log sub componment.

  • +
  • Log message template format string, same as fmt_string in lib fmt, {} is used as a place holder for format args .

  • +
  • Log args for template format string, args numbers should be aligned with size of {}s.

  • +
+

Below is an example for using simple log inside abs kernel:

+
IPEX_INFO_LOG("OPS", "", "Add a log for inside ops {}", "abs");
+
+
+
+
+

Event Log

+

Event log is used for recording a whole event, such as an operator calculation. The whole event is identified by an unique event_id. You can also mark each step by using step_id. Use IPEX_XXX_EVENT_END() to complete the logging of the whole event. XXX represents the log level mentioned above. It will be used as the log level for all logs within one single log event.

+

Below is an example for using event log:

+
IPEX_EVENT_LOG("OPS", "", "record_avg_pool", "start", "Here record the time start with arg:{}", arg);
+prepare_data();
+IPEX_EVENT_LOG("OPS", "", "record_avg_pool", "data_prepare_finish", "Here record the data_prepare_finish with arg:{}", arg);
+avg_pool();
+IPEX_INFO_EVENT_END("OPS", "", "record_avg_pool", "finish conv", "Here record the end");
+
+
+
+
+
+

Enviornment settings

+

Intel® Extension for PyTorch* provides five enviornment variables for configuring log output:

+
    +
  • IPEX_LOG_LEVEL, accept integar or string, default is -1 for DISABLED.

  • +
  • IPEX_LOG_COMPONENT, accept string, used for specifying the log component and sub log component you would like to log, default is “ALL”. The log component and sub log component are separated by /. You could also specify several log components, such as “OPS;MEMORY”.

  • +
  • IPEX_LOG_OUTPUT, accept string. If you are using IPEX_LOG_OUTPUT, than all the logs will recorded inside a file rather than the console. Example: export IPEX_LOG_OUTPUT=”./ipex.log”.

  • +
  • IPEX_LOG_ROTATE_SIZE, accept integar, default is 10. Can be used only with IPEX_LOG_OUTPUT, for specifing how large file will be used when rotating this log, size is MB.

  • +
  • IPEX_LOG_SPLIT_SIZE, accept integar, default = null. Can be used only with IPEX_LOG_OUTPUT, for specifing how large file will be used when splitting the logs, size is MB.

  • +
+
+
+

Usage in python

+
    +
  • torch.xpu.set_log_level(log_level) and torch.xpu.get_log_level(), these two functions are used for getting and setting the log level.

  • +
  • torch.xpu.set_log_output_file_path(log_path) and torch.xpu.get_log_output_file_path(), these two functions are used for getting and setting the log output file path, once log output file path is set, logs will be recorded in file only.

  • +
  • torch.xpu.set_log_rotate_file_size(file size) and torch.xpu.get_log_rotate_file_size(), these two functions are used for getting and setting the log rotate file size. Can be used when output file path is set.

  • +
  • torch.xpu.set_log_split_file_size(file size) and torch.xpu.get_log_split_file_size(), these two functions are used for getting and setting the log split file size. Can be used when output file path is set.

  • +
  • torch.xpu.set_log_component(log_component), and torch.xpu.get_log_component(), these two functions are used for getting and setting the log component. The log component string are the same as defined in enviornment settings.

  • +
+
+
+

Replace IPEX_SIMPLE_TRACE

+

Use torch.xpu.set_log_level(0) to get logs to replace the previous usage in IPEX_SIMPLE_TRACE.

+
+
+

Replace IPEX_VERBOSE

+

Use torch.xpu.set_log_level(1) to get logs to replace the previous usage in IPEX_VERBOSE.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/nhwc.html b/xpu/2.3.110+xpu/tutorials/features/nhwc.html new file mode 100644 index 000000000..da34d1739 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/nhwc.html @@ -0,0 +1,443 @@ + + + + + + + Channels Last — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Channels Last

+
+

What is Channels Last

+

Note: In PyTorch, memory format refers to data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space. Memory format has the same semantic meaning as layout in oneDNN. Layout in PyTorch has other semantic of describing dense or sparse with the attributes: ‘torch.strided’, ‘torch.sparse_coo’.

+

On CNN models, the canonical order of tensor dimensions is assigned with semantic meaning. For example the input tensor of 2D convolution is of NCHW by default on PyTorch - <batch_size, channels, height, width>. NHWC is an alternative way of describing the tensor dimensions - <batch_size, height, width, channels>.

+

Look at the following image of illustrating NCHW and NHWC when N=1. Actually when N=1, NHWC has the same format with BMP file image. +fig-1-memory-layout

+

PyTorch refers to NCHW as torch.contiguous_format (the default memory format) and to NHWC as torch.channels_last, which is a new feature as of the 1.5 release.

+

TensorFlow uses NHWC as the default memory format because NHWC has a performance advantage over NCHW. On Intel® platforms, we propose to optimize Channels Last memory path for the following reasons:

+
    +
  • Performance - NHWC performance is not as good as blocked memory format (nChw16c), but it is close, and much better performance than NCHW.

  • +
  • User Experience - Operator coverage of NHWC would be higher than blocked memory format, so user experience is better. To be specific, it is difficult to enable operators that manipulates dim on blocked format such as sum(dim=?). You would need to convert tensor from blocked memory format back to NHWC using to_dense(), before feeding it into sum(). This is naturally supported on Channels Last memory format already.

  • +
  • Upstream - Will be easier since CPU doesn’t hold secret ingredient and both inference and training will be covered.

  • +
+
+
+

Memory Format Is All That Matters

+

On CNN models, memory format is almost the foundation of any upper level design. One important fact is that converting memory format could be very expensive. Thus, in case that multiple CNN operators are performed in sequence, e.g. Conv2d -> ReLU -> Conv2d, it’s beneficial to transform them from different memory formats once, do computation and reorder them back.

+

On PyTorch, you can use 3 types of memory formats on CNN models:

+
+

a. NCHW (default)

+
device='cpu' # or 'xpu'
+if device == 'xpu':
+  import intel_extension_for_pytorch
+
+## NB: internally blocked format will still be used.
+##   aka. we do 'reorder' for 'input', 'weight' and 'output',
+##   and believe me this is expensive, roughly 50% perf loss...
+input = torch.randn(1, 10, 32, 32).to(device)
+model = torch.nn.Conv2d(10, 20, 1, 1).to(device)
+output = model(input)
+
+
+
+
+

b. NHWC

+
device='cpu' # or 'xpu'
+if device == 'xpu':
+  import intel_extension_for_pytorch
+
+input = torch.randn(1, 10, 32, 32).to(device)
+model = torch.nn.Conv2d(10, 20, 1, 1).to(device)
+## NB: convert to Channels Last memory format.
+##   oneDNN supports NHWC for feature maps (input, output),
+##   but weight still needs to be of blocked format.
+##   Still we can save reorders for feature maps.
+input = input.to(memory_format=torch.channels_last)
+model = model.to(memory_format=torch.channels_last)
+output = model(input)
+
+
+
+
+

c. Blocked (nChw16c, on CPU)

+
from torch.utils import mkldnn as mkldnn_utils
+input = torch.randn(1, 10, 32, 32)
+model = torch.nn.Conv2d(10, 20, 1, 1)
+## NB: convert to blocked memory format.
+##   Note that 'output' is in blocked memory format,
+##   in case the subsequent operator doesn't support blocked memory format
+##   you need to manually reorder it back to NCHW by output.to_dense()
+##   mkldnn_utils.to_mkldnn(model) is used to prepack the weight, this will save weight reorder time
+##   for inference. For training, it is not needed.
+input = input.to_mkldnn()
+model = mkldnn_utils.to_mkldnn(model)
+output = model(input)
+
+
+

Better to explain the concepts here with a diagram, the dotted lines indicate simple memory view, no hard copy. +fig-2(1)-pt-conv-layout-path-dispatch

+

Conclusion is that NHWC path saves the reorders from feature maps compared with NCHW path, but still weight reorder is necessary since oneDNN requires weights to be in blocked memory format. From performance perspective, when batch_size=N, weight reorder is minimum compared to feature map reorder. But when batch_size=1, weight reorder is usually not negligible. So whether to enable weight prepacking on channels last memory format needs further discussion.

+
+
+
+

PyTorch Strided Layout

+

Before moving on, let’s explain how PyTorch organizes tensors in memory - the layout. Here we only focus on dense tensors, skipping ‘coo’ layout of sparse tensor.

+

The question itself can be reinterpreted as, for a tensor of size <N, C, H, W>, how does PyTorch access the element with index <n, c, h, w> from memory? The answer is stride:

+
tensor: <N, C, H, W>
+index: <n, c, h, w>
+strides: <CHW, HW, W, 1>
+offset(n,c,h,w) = stride_n * n + stride_c * c + stride_h * h + stride_w * w
+                = CHW * n + HW * c + W * h + 1 * w
+
+
+

One merit of introducing stride is that it can express noncontiguous tensors, e.g. a slice of big tensor. For example, the ‘Xs’ in the following image have a stride of <n1+n2, 1>.

+

fig-3-pytorch-strided-layout

+

Keep in mind that PyTorch Tensor does not have an attribute called ‘memory_format’ or something else. The memory format expression completely relies on size and stride. The design principle can be found at reference: RFC: Memory format (aka layout aka NHWC) support. No matter what the tensor’s memory format is, we need a logical canonical order for the dimensions - that is NCHW on PyTorch. Thus, size and stride are ALWAYS described in the order of NCHW. Let’s now look at the Channels Last case of the previous question:

+
tensor: <N, C, H, W>
+index: <n, c, h, w>
+strides: <HWC, 1, WC, C>
+offset(n,c,h,w) = stride_n * n + stride_c * c + stride_h * h + stride_w * w
+                = HWC * n + 1 * c + WC * h + C * w
+
+
+

Actually, this pattern applies to ALL other memory formats as long as it is 4-dim, e.g. strides for CHWN would be <1, HWN, WN, N>.

+
+
+

Channels Last Memory Format APIs

+
+

a. tensor creation

+
device='cpu' # or 'xpu'
+if device == 'xpu':
+  import intel_extension_for_pytorch
+
+x = torch.empty(N, C, H, W, memory_format=torch.channels_last).to(device)
+
+
+
+
+

b. tensor conversion

+
device='cpu' # or 'xpu'
+if device == 'xpu':
+  import intel_extension_for_pytorch
+
+## .contiguous() transforms NHWC noncontiguous to NHWC contiguous.
+## .to() converts NCHW tensor to NHWC one, it is outplace.
+x = x.to(device)
+x = x.contiguous(memory_format=torch.channels_last)
+x = x.to(memory_format=torch.channels_last)
+
+## contiguous check
+x.is_contiguous(memory_format=torch.channels_last)
+
+
+
+
+

c. model conversion

+
device='cpu' # or 'xpu'
+if device == 'xpu':
+  import intel_extension_for_pytorch
+
+## NB: tensor.to() is an outplace operation
+##   model.to() is inplace. It calls _apply() which is inplace.
+model = model.to(device).to(memory_format=torch.channels_last)
+input = input.to(device).to(memory_format=torch.channels_last)
+
+
+
+
+

d. operator coverage in PyTorch

+

Detailed operator coverage information has been listed at reference Operators-with-Channels-Last-support.

+

Some spontaneous questions:

+
    +
  • How to tell whether this model or operator support Channels Last? - This requires manual memory format check, aka. ‘torch.channels_last’ input and weight shall NOT generate ‘torch.contiguous_format’ output.

  • +
  • What if the model comprises of operator not supported Channels Last? - No errors messages will be shown, the NHWC tensor will be handled by the operator as a non-contiguous NCHW tensor, so result might not be correct depending on the algorithm of this operator.

  • +
+
+
+
+

Writing Channels Last Kernels on CPU

+
+

a. Register Channels Last Kernel in ATen Native Manner

+

The general guideline has been listed under reference Writing-memory-format-aware-operators, not to repeat here. You may take one of my recent PR optimize upsample performance linear mode on CPU as an example, which also demonstrates NHWC performance advantage over NCHW because of the ease of vectorization.

+
+
+

b. Register oneDNN Kernel on Channels Last

+

Registering a oneDNN kernel under Channels Last memory format on CPU is no different from cuDNN: Only very few upper level changes are needed, such as accommodate ‘contiguous()’ to ‘contiguous(suggested_memory_format)’. The automatic reorder of oneDNN weight shall have been hidden in ideep.

+
+
+
+

oneDNN NHWC APIs

+

Compared to NCHW interfaces, 2 parts need to be addressed on NHWC interfaces:

+
+

a. Create NHWC Memory

+

The logical size and stride description of oneDNN is always in NCHW, this is identical to PyTorch. Example code such as

+
/* create md from memory::format_tag */
+auto src_md = memory::desc(
+        {N, C, H, W}, // logical dims, the order is defined by a primitive
+        memory::data_type::f32, // tensor's data type
+        memory::format_tag::nhwc // memory format, NHWC in this case
+);
+
+/* alternative: create md from strides */
+auto src_md = memory::desc(
+        {N, C, H, W}, // logical dims, the order is defined by a primitive
+        memory::data_type::f32, // tensor's data type
+        {stride_N, stride_C, stride_H, stride_W} // the strides
+);
+
+/* create memory */
+auto src_mem = memory(src_md, src_data_ptr, engine);
+
+
+
+
+

b. Create Convolution Primitive

+
    +
  • NCHW - create memory::desc with any card for ‘input’, ‘output’ and ‘weight’; query proposed memory::desc from convolution primitive;

  • +
  • NHWC - create memory::desc with format_tag::nhwc for ‘input’ and ‘output’, use any for ‘weight’; if we use hwio for ‘weight’ convolution primitive will be created with gemm rather jit avx512.

  • +
+
+
+
+

Channels Last 1D support on XPU

+

Both stock PyTorch and Intel® Extension for PyTorch* support Channels Last(2D) and Channels Last 3D, however, regarding Channels Last 1D, they are different. Stock PyTorch doesn’t support Channels Last 1D, while XPU could supply limited support for Channels Last 1D. +We only support Channels Last 1D memory format in these operators: Conv1D, BatchNorm1D, MaxPool1D, Concat, binary add, binary div, upsample linear and upsample nearest.

+

The usage of Channels Last 1D on XPU is different from stock PyTorch Channels Last(2D) or Channels Last 3D. We use torch.xpu.to_channels_last_1d() to do conversation for both input tensor and model. See below:

+
import torch
+import intel_extension_for_pytorch
+
+sycl_device = torch.device("xpu")
+
+
+class Model(torch.nn.Module):
+    def __init__(self):
+        super(Model, self).__init__()
+        self.block = torch.nn.Sequential(
+            torch.nn.Conv1d(3, 3, kernel_size=3, stride=1, padding=1, bias=False),
+            torch.nn.BatchNorm1d(3)
+        )
+
+    def forward(self, x):
+        x = self.block(x)
+        return x
+
+
+model = Model()
+test_input = torch.rand([2, 3, 4])
+test_input_xpu = test_input.to(sycl_device)
+test_input_xpu = torch.xpu.to_channels_last_1d(test_input_xpu) # Channels Last 1D conversation for tenor
+model = model.to(sycl_device)
+model = torch.xpu.to_channels_last_1d(model) # Channels Last 1D conversation for mode
+xpu_res = model(test_input_xpu)
+
+print(torch.xpu.is_contiguous_channels_last_1d(xpu_res))
+
+
+
+

a. tensor conversion with Channels Last 1D

+
input_xpu = torch.xpu.to_channels_last_1d(input_xpu)
+
+
+
+
+

b. model conversion with Channels Last 1D

+
model = torch.xpu.to_channels_last_1d(model)
+
+
+
+
+

c. determine if in Channels Last 1D memory format

+
print(torch.xpu.is_contiguous_channels_last_1d(input))
+
+
+

Note that because Meta doesn’t support Channels Last 1D feature now: RFC: A suggestion of channels last memory format implementation for 3D tensor, expect Channels Last 1D APIs above, other APIs from stock PyTorch may be invalid. E.g.: If you want to use memory format corrsponding API for Channels Last 1D, it cannot work as you wish.

+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/profiler_kineto.html b/xpu/2.3.110+xpu/tutorials/features/profiler_kineto.html new file mode 100644 index 000000000..3d621519c --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/profiler_kineto.html @@ -0,0 +1,321 @@ + + + + + + + Kineto Supported Profiler Tool (Prototype) — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Kineto Supported Profiler Tool (Prototype)

+
+

Introduction

+

The Kineto supported profiler tool is an extension of PyTorch* profiler for profiling operators’ executing time cost on GPU devices. With this tool, you can get information in many fields of the run models or code scripts. Build Intel® Extension for PyTorch* with Kineto support as default and enable this tool using the with statement before the code segment.

+
+
+

Use Case

+

To use the Kineto supported profiler tool, you need to build Intel® Extension for PyTorch* from source or install it via prebuilt wheel. You also have various methods to disable this tool.

+
+

Build Tool

+

The build flag USE_PTI is default ON for Intel® Extension for PyTorch* to enable the PTI-based Kineto Profiler. Before building, you need to make sure the PTI-SDK is well preinstalled and sourced in your env. +Here is the command you can use to download the PTI-SDK onto your machine: wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/5987ec30-be32-4dee-870f-7b97a1113488/l_intel-pti-dev_p_0.9.0.33_offline.sh. After downloading the file, you need to install it by running sh l_intel-pti-dev_p_0.9.0.33_offline.sh. After you install the PTI, you need to run source <path-of-installation>/pti/latest/env/vars.sh to source the PTI, then you can start your building.

+
+
+

Use Tool

+
+

Add Profiler Into Script

+

All the usages are aligned with the official PyTorch* suggested. Please refer to PyTorch*’s tutorial page for the first step.

+

In your model script, write with statement to enable the Kineto supported profiler tool ahead of your code snippets, as shown in the following example:

+
# import all necessary libraries
+import torch
+from torch.profiler import profile, ProfilerActivity
+import intel_extension_for_pytorch
+
+# these lines won't be profiled before enabling profiler tool
+input_tensor = torch.randn(1024, dtype=torch.float32, device='xpu:0')
+
+# enable Kineto supported profiler tool with a `with` statement
+with profile(activities=[ProfilerActivity.CPU,
+                         ProfilerActivity.XPU]) as prof:
+    # do what you want to profile here after the `with` statement with proper indent
+    output_tensor_1 = torch.nonzero(input_tensor)
+    output_tensor_2 = torch.unique(input_tensor)
+
+# print the result table formatted by the profiler tool as your wish
+print(prof.key_averages().table())
+
+
+

In your model script, you can also assign a schedule for profile loops of iterations, as shown in the following example:

+
from torch.profiler import schedule
+
+# assign a customized schedule
+my_schedule = schedule(
+    skip_first=10,
+    wait=1,
+    warmup=3,
+    active=1,
+    repeat=2)
+
+# also define a handler for outputing results
+def trace_handler(p):
+    print(p.key_averages().table(sort_by="self_xpu_time_total", row_limit=10)
+    p.export("/tmp/trace_" + str(p.step_num) + ".json")
+
+# pass customized schedule and trace handler to profiler outside the for-loop
+with profile(activities=[ProfilerActivity.CPU,
+                         ProfilerActivity.XPU],
+             schedule=my_schedule,
+             on_trace_ready=trace_handler) as prof:
+    for iter in range(len(dataloader)):
+        model(input)
+        # don't forget a step() at the end of each loop
+        prof.step()
+
+
+

There are a number of useful parameters defined in torch.profiler.profile. Many of them are aligned with usages defined in PyTorch*’s official profiler, such as record_shapes, a very useful parameter to control whether to record the shape of input tensors for each operator. To enable Kineto supported profiler on XPU backend, remember to add torch.profiler.ProfilerActivity.XPU into the list of activities. For the usage of more parameters, please refer to PyTorch*’s API reference.

+
+
+

Disable Tool in Model Script

+

To disable this profiler tool in your model script, you must remove those profiler related code as PyTorch* doesn’t offer a switch in torch.profiler.profile API. To reduce effort to switch the profiler on and off, it is suggested to use contextlib for control like below:

+
import contextlib
+
+def profiler_setup(profiling=False, *args, **kwargs):
+    if profiling:
+        return torch.profiler.profile(*args, **kwargs)
+    else:
+        return contextlib.nullcontext()
+
+# you can pass official arguments as normal
+with profiler_setup(profiling=should_profile,
+                    activities=[ProfileActivity.XPU],
+                    schedule=my_schedule,
+                    on_trace_ready=trace_handler) as prof:
+    for iter in range((len(dataloader)):
+        model(input)
+
+        if should_profile:
+            prof.step()
+
+
+
+
+

Profile on Multi-device Application

+

Follow typical usages for profiling multi-device application. Explicitly call torch.xpu.synchronize(device_id) for all involved devices. Such as:

+
# Run this example, please make sure you have more than one device.
+assert torch.xpu.device_count() > 1, "This example need more than one device existed."
+
+# put first input on device "xpu:0"
+a_0 = torch.randn(100).to(torch.device("xpu:0"))
+# put second input on device "xpu:1"
+a_1 = torch.randn(100).to(torch.device("xpu:1"))
+
+# Start profiler as normal
+with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.XPU]) as prof:
+    # run kernel on "xpu:0"
+    b_0 = a_0 + a_0
+    # run kernel on "xpu:1"
+    b_1 = a_1 + a_1
+    # explicitly synchronize all involved devices
+    torch.xpu.synchronize(torch.device("xpu:0"))
+    torch.xpu.synchronize(torch.device("xpu:1"))
+
+# You may check kernels on difference devices from chrome trace
+prof.export_chrome_trace("trace_example_on_multi_device.json")
+
+
+
+
+
+

Result

+

Using the first script shown above in Use Tool part, you’ll see the result table printed out to the console as below:

+

Kiento_profiler_result_console

+

In this result, you can find several fields including:

+
    +
  • Name: the name of run operators, runtime functions or kernels.

  • +
  • Self CPU %, Self CPU: the time consumed by the operator itself at host excluded its children operator call. The column marked with percentage sign shows the propotion of time to total self cpu time. While an operator calls more than once in a run, the self cpu time may increase in this field.

  • +
  • CPU total %, CPU total: the time consumed by the operator at host included its children operator call. The column marked with percentasge sign shows the propotion of time to total cpu time. While an operator calls more than once in a run, the cpu time may increase in this field.

  • +
  • CPU time avg: the average time consumed by each once call of the operator at host. This average is calculated on the cpu total time.

  • +
  • Self XPU, Self XPU %: similar to Self CPU (%) but shows the time consumption on XPU devices.

  • +
  • XPU total: similar to CPU total but shows the time consumption on XPU devices.

  • +
  • XPU time avg: similar to CPU time avg but shows average time sonsumption on XPU devices. This average is calculated on the XPU total time.

  • +
  • # of Calls: number of call for each operators in a run.

  • +
+
+
+

Export to Chrome Trace

+

You can export the result to a json file and then load it in the Chrome trace viewer (chrome://tracing) or Perfetto viewer (ui.perfetto.dev) by adding this line in your model script:

+
prof.export_chrome_trace("trace_file.json")
+
+
+

You can examine the sequence of profiled operators, runtime functions and XPU kernels in these trace viewers. Here shows a trace result for ResNet50 run on XPU backend viewed by Perfetto viewer:

+

profiler_kineto_result_perfetto_viewer

+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/profiler_legacy.html b/xpu/2.3.110+xpu/tutorials/features/profiler_legacy.html new file mode 100644 index 000000000..0b157bb24 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/profiler_legacy.html @@ -0,0 +1,143 @@ + + + + + + + Legacy Profiler Tool (Deprecated) — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Legacy Profiler Tool (Deprecated)

+
+

Introduction

+

The legacy profiler tool will be deprecated from Intel® Extension for PyTorch* very soon. Please use Kineto Supported Profiler Tool instead for profiling operators’ executing time cost on Intel® GPU devices.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/simple_trace.html b/xpu/2.3.110+xpu/tutorials/features/simple_trace.html new file mode 100644 index 000000000..aecc888a0 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/simple_trace.html @@ -0,0 +1,235 @@ + + + + + + + Simple Trace Tool (Deprecated) — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Simple Trace Tool (Deprecated)

+
+

Introduction

+

Simple Trace is a built-in debugging tool that lets you control printing out the call stack for a piece of code. You can enable this tool and have it automatically print out verbose messages of called operators in a stack format with indenting to distinguish the context. You can enable and disable this tool using a simple method.

+
+
+

Use Case

+

To use the simple trace tool, you need to build Intel® Extension for PyTorch* from source and add explicit calls to enable and disable tracing in your model script. When enabled, the trace messages will be printed to the console screen by default, along with verbose log messages.

+
+

Enable and Disable Tool

+

IPEX_SIMPLE_TRACE can be used to turn ON/OFF simple trace. It is set as 0 by default. You can set 1 to enable simple trace for all operators:

+
export IPEX_SIMPLE_TRACE=1 
+
+
+
+
+

Use Simple Trace in Model

+

In your model script, bracket code in your model script with calls to torch.xpu.enable_simple_trace() and torch.xpu.disable_simple_trace(), as shown in the following example:

+
# import all necessary libraries
+import torch
+import intel_extension_for_pytorch
+
+print(torch.xpu.using_simple_trace())   # False
+a = torch.randn(100).xpu()              # this line won't be traced
+
+torch.xpu.enable_simple_trace()         # to enable simple trace tool
+
+# test code (with tracing enabled) begins here
+b = torch.randn(100).xpu()
+c = torch.unique(b)
+# test code ends here
+
+torch.xpu.disable_simple_trace()        # to disable simple trace tool
+
+
+

The simple trace output will start after being enabled, and will continue until +the call to disable it, so be careful with your model script logic so the disable call is +not unintentionally skipped.

+
+
+

Results

+

Using the script shown above as the exmaple, you’ll see these messages printed out to the console:

+
[262618.262618]  Call  into  OP: wrapper__empty_strided -> at::AtenIpexTypeXPU::empty_strided (#0)
+[262618.262618]  Step out of OP: wrapper__empty_strided -> at::AtenIpexTypeXPU::empty_strided (#0)
+[262618.262618]  Call  into  OP: wrapper__copy_ -> at::AtenIpexTypeXPU::copy_ (#1)
+[262618.262618]  Step out of OP: wrapper__copy_ -> at::AtenIpexTypeXPU::copy_ (#1)
+[262618.262618]  Call  into  OP: wrapper___unique2 -> at::AtenIpexTypeXPU::_unique2 (#2)
+[262618.262618]    Call  into  OP: wrapper__clone -> at::AtenIpexTypeXPU::clone (#3)
+[262618.262618]      Call  into  OP: wrapper__empty_strided -> at::AtenIpexTypeXPU::empty_strided (#4)
+[262618.262618]      Step out of OP: wrapper__empty_strided -> at::AtenIpexTypeXPU::empty_strided (#4)
+[262618.262618]      Call  into  OP: wrapper__copy_ -> at::AtenIpexTypeXPU::copy_ (#5)
+[262618.262618]      Step out of OP: wrapper__copy_ -> at::AtenIpexTypeXPU::copy_ (#5)
+[262618.262618]    Step out of OP: wrapper__clone -> at::AtenIpexTypeXPU::clone (#3)
+[262618.262618]    Call  into  OP: wrapper___reshape_alias -> at::AtenIpexTypeXPU::_reshape_alias (#6)
+[262618.262618]    Step out of OP: wrapper___reshape_alias -> at::AtenIpexTypeXPU::_reshape_alias (#6)
+[262618.262618]    Call  into  OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#7)
+[262618.262618]    Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#7)
+[262618.262618]    Call  into  OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#8)
+[262618.262618]    Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#8)
+[262618.262618]    Call  into  OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#9)
+[262618.262618]    Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#9)
+[262618.262618]    Call  into  OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#10)
+[262618.262618]    Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#10)
+[262618.262618]    Call  into  OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#11)
+[262618.262618]    Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#11)
+[262618.262618]    Call  into  OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#12)
+[262618.262618]    Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#12)
+[262618.262618]    Call  into  OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#13)
+[262618.262618]    Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#13)
+[262618.262618]    Call  into  OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#14)
+[262618.262618]    Step out of OP: wrapper_memory_format_empty -> at::AtenIpexTypeXPU::empty (#14)
+[262618.262618]    Call  into  OP: wrapper__as_strided -> at::AtenIpexTypeXPU::as_strided (#15)
+[262618.262618]    Step out of OP: wrapper__as_strided -> at::AtenIpexTypeXPU::as_strided (#15)
+[262618.262618]    Call  into  OP: wrapper___local_scalar_dense -> at::AtenIpexTypeXPU::_local_scalar_dense (#16)
+[262618.262618]    Step out of OP: wrapper___local_scalar_dense -> at::AtenIpexTypeXPU::_local_scalar_dense (#16)
+[262618.262618]    Call  into  OP: wrapper__as_strided -> at::AtenIpexTypeXPU::as_strided (#17)
+[262618.262618]    Step out of OP: wrapper__as_strided -> at::AtenIpexTypeXPU::as_strided (#17)
+[262618.262618]    Call  into  OP: wrapper___local_scalar_dense -> at::AtenIpexTypeXPU::_local_scalar_dense (#18)
+[262618.262618]    Step out of OP: wrapper___local_scalar_dense -> at::AtenIpexTypeXPU::_local_scalar_dense (#18)
+[262618.262618]    Call  into  OP: wrapper__resize_ -> at::AtenIpexTypeXPU::resize_ (#19)
+[262618.262618]    Step out of OP: wrapper__resize_ -> at::AtenIpexTypeXPU::resize_ (#19)
+[262618.262618]  Step out of OP: wrapper___unique2 -> at::AtenIpexTypeXPU::_unique2 (#2)
+[262618.262618]  Call  into  OP: wrapper__copy_ -> at::AtenIpexTypeXPU::copy_ (#20)
+[262618.262618]  Step out of OP: wrapper__copy_ -> at::AtenIpexTypeXPU::copy_ (#20)
+
+
+

The meanings of each field are shown as below:

+
    +
  • pid.tid, [262618.262618]: the process id and the thread id responsible to the printed-out line.

  • +
  • behavior, Call into OP, Step out of OP: call-in or step-out behavior of the operators in a run.

  • +
  • name1 -> name2, wrapper__empty_strided -> at::AtenIpexTypeXPU::empty_strided: the calling operator for the current step. The name1 before the arrow shows the wrapper from PyTorch. The name2 after the arrow shows the function name of which was called in or stepped out in Intel® Extension for PyTorch* at the current step.

  • +
  • (#No.), (#0): index of the called operators. This index was numbered from 0 in the order of each operator to be called.

  • +
  • indent: the indent ahead of every behavior shows the nested relationship between operators. The operator call-in line with more indent should be a child of what was called above it.

  • +
+

With this output, you can see the calling stack of the traced script without using complicated debug tools such as gdb.

+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/features/torch_compile_gpu.html b/xpu/2.3.110+xpu/tutorials/features/torch_compile_gpu.html new file mode 100644 index 000000000..e4f4abe27 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/features/torch_compile_gpu.html @@ -0,0 +1,239 @@ + + + + + + + torch.compile for GPU (Beta) — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

torch.compile for GPU (Beta)

+
+
+

Introduction

+

Intel® Extension for PyTorch* now empowers users to seamlessly harness graph compilation capabilities for optimal PyTorch model performance on Intel GPU via the flagship torch.compile API through the default “inductor” backend (TorchInductor). The Triton compiler has been the core of the Inductor codegen supporting various accelerator devices. Intel has extended TorchInductor by adding Intel GPU support to Triton. Additionally, post-op fusions for convolution and matrix multiplication, facilitated by oneDNN fusion kernels, contribute to enhanced efficiency for computational intensive operations. Leveraging these features is as simple as using the default “inductor” backend, making it easier than ever to unlock the full potential of your PyTorch models on Intel GPU platforms.

+
+
+

Required Dependencies

+

Verified version:

+
    +
  • torch : v2.3

  • +
  • intel_extension_for_pytorch : v2.3

  • +
  • triton : >= v3.0.0

  • +
+

Install Intel® oneAPI Base Toolkit 2024.2.1.

+

Follow Intel® Extension for PyTorch* Installation to install torch and intel_extension_for_pytorch firstly.

+

Triton could be directly installed using the following command:

+
pip install --pre pytorch-triton-xpu==3.0.0+1b2f15840e --index-url https://download.pytorch.org/whl/nightly/xpu
+
+
+

Remember to activate the oneAPI basekit by following commands.

+
# {dpcpproot} is the location for dpcpp ROOT path and it is where you installed oneAPI DPCPP, usually it is /opt/intel/oneapi/compiler/latest or ~/intel/oneapi/compiler/latest
+source {dpcpproot}/env/vars.sh
+
+
+
+
+

Example Usage

+
+

Inferenece with torch.compile

+
import torch
+import intel_extension_for_pytorch
+
+# create model
+model = SimpleNet().to("xpu")
+
+# compile model
+compiled_model = torch.compile(model, options={"freezing": True})
+
+# inference main
+input = torch.rand(64, 3, 224, 224, device=torch.device("xpu"))
+with torch.no_grad():
+    with torch.xpu.amp.autocast(dtype=torch.float16):
+        output = compiled_model(input)
+
+
+
+
+

Training with torch.compile

+
import torch
+import intel_extension_for_pytorch
+
+# create model and optimizer
+model = SimpleNet().to("xpu")
+optimizer = torch.optim.SGD(model.parameters(), lr=..., momentum=..., weight_decay=...)
+
+# compile model
+compiled_model = torch.compile(model)
+
+# training main
+input = torch.rand(64, 3, 224, 224, device=torch.device("xpu"))
+with torch.xpu.amp.autocast(dtype=torch.bfloat16):
+    output = compiled_model(input)
+    loss = loss_function(output)
+optimizer.zero_grad()
+loss.backward()
+optimizer.step()
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/getting_started.html b/xpu/2.3.110+xpu/tutorials/getting_started.html new file mode 100644 index 000000000..54a39cf66 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/getting_started.html @@ -0,0 +1,199 @@ + + + + + + + Quick Start — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Quick Start

+

The following instructions assume you have installed the Intel® Extension for PyTorch*. For installation instructions, refer to Installation.

+

To start using the Intel® Extension for PyTorch* in your code, you need to make the following changes:

+
    +
  1. Import the extension with import intel_extension_for_pytorch as ipex.

  2. +
  3. Move model and data to GPU with to('xpu'), if you want to run on GPU.

  4. +
  5. Invoke the optimize() function to apply optimizations.

  6. +
  7. For TorchScript, invoke torch.jit.trace() and torch.jit.freeze().

  8. +
+

Important: It is highly recommended to import intel_extension_for_pytorch right after import torch, prior to importing other packages.

+

The example below demostrates how to use the Intel® Extension for PyTorch*:

+
import torch
+import intel_extension_for_pytorch as ipex
+
+model = Model()
+model.eval() # Set the model to evaluation mode for inference, as required by ipex.optimize() function.
+data = ...
+dtype=torch.float32 # torch.bfloat16, torch.float16 (float16 only works on GPU)
+
+##### Run on GPU ######
+model = model.to('xpu')
+data = data.to('xpu')
+#######################
+
+model = ipex.optimize(model, dtype=dtype)
+
+########## FP32 ############
+with torch.no_grad():
+####### BF16 on CPU ########
+with torch.no_grad(), torch.cpu.amp.autocast():
+##### BF16/FP16 on GPU #####
+with torch.no_grad(), torch.xpu.amp.autocast(enabled=True, dtype=dtype, cache_enabled=False):
+############################
+  ###### Torchscript #######
+  model = torch.jit.trace(model, data)
+  model = torch.jit.freeze(model)
+  ###### Torchscript #######
+
+  model(data)
+
+
+

More examples, including training and usage of low precision data types are available at Examples.

+
+

Execution

+

There are some environment variables in runtime that can be used to configure executions on GPU. Please check Advanced Configuration for more detailed information.

+

Set OCL_ICD_VENDORS with default path /etc/OpenCL/vendors. +Set CCL_ROOT if you are using multi-GPU.

+
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
+export CCL_ROOT=${CONDA_PREFIX} 
+python <script>
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/introduction.html b/xpu/2.3.110+xpu/tutorials/introduction.html new file mode 100644 index 000000000..965ecdd00 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/introduction.html @@ -0,0 +1,164 @@ + + + + + + + Introduction — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Introduction

+

Intel® Extension for PyTorch* extends PyTorch* with the latest performance optimizations for Intel hardware. +Optimizations take advantage of Intel XeMatrix Extensions (XMX) AI engines on Intel discrete GPUs. +The extension provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* xpu device.

+

For the detailed list of supported features and usage instructions, refer to Features.

+
+

Get Started

+ +
+
+

API Documentation

+

For detailed description of the Intel® Extension for PyTorch* APIs, refer to the API Documentation section.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/known_issues.html b/xpu/2.3.110+xpu/tutorials/known_issues.html new file mode 100644 index 000000000..e36910670 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/known_issues.html @@ -0,0 +1,311 @@ + + + + + + + Troubleshooting — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Troubleshooting

+
+

General Usage

+
    +
  • Problem: FP64 data type is unsupported on current platform.

    + +
  • +
  • Problem: Runtime error invalid device pointer if import horovod.torch as hvd before import intel_extension_for_pytorch.

    +
      +
    • Cause: Intel® Optimization for Horovod* uses utilities provided by Intel® Extension for PyTorch*. The improper import order causes Intel® Extension for PyTorch* to be unloaded before Intel® +Optimization for Horovod* at the end of the execution and triggers this error.

    • +
    • Solution: Do import intel_extension_for_pytorch before import horovod.torch as hvd.

    • +
    +
  • +
  • Problem: Number of dpcpp devices should be greater than zero.

    +
      +
    • Cause: If you use Intel® Extension for PyTorch* in a conda environment, you might encounter this error. Conda also ships the libstdc++.so dynamic library file that may conflict with the one shipped +in the OS.

    • +
    • Solution: Export the libstdc++.so file path in the OS to an environment variable LD_PRELOAD.

    • +
    +
  • +
  • Problem: Symbol undefined caused by _GLIBCXX_USE_CXX11_ABI.

    +
    ImportError: undefined symbol: _ZNK5torch8autograd4Node4nameB5cxx11Ev
    +
    +
    +
      +
    • Cause: DPC++ does not support _GLIBCXX_USE_CXX11_ABI=0, Intel® Extension for PyTorch* is always compiled with _GLIBCXX_USE_CXX11_ABI=1. This symbol undefined issue appears when PyTorch* is +compiled with _GLIBCXX_USE_CXX11_ABI=0.

    • +
    • Solution: Pass export GLIBCXX_USE_CXX11_ABI=1 and compile PyTorch* with particular compiler which supports _GLIBCXX_USE_CXX11_ABI=1. We recommend using prebuilt wheels +in [download server](https:// developer.intel.com/ipex-whl-stable-xpu) to avoid this issue.

    • +
    +
  • +
  • Problem: -997 runtime error when running some AI models on Intel® Arc™ A-Series GPUs.

    +
      +
    • Cause: Some of the -997 runtime error are actually out-of-memory errors. As Intel® Arc™ A-Series GPUs have less device memory than Intel® Data Center GPU Flex Series 170 and Intel® Data Center GPU +Max Series, running some AI models on them may trigger out-of-memory errors and cause them to report failure such as -997 runtime error most likely. This is expected. Memory usage optimization is a work in progress to allow Intel® Arc™ A-Series GPUs to support more AI models.

    • +
    +
  • +
  • Problem: Building from source for Intel® Arc™ A-Series GPUs fails on WSL2 without any error thrown.

    +
      +
    • Cause: Your system probably does not have enough RAM, so Linux kernel’s Out-of-memory killer was invoked. You can verify this by running dmesg on bash (WSL2 terminal).

    • +
    • Solution: If the OOM killer had indeed killed the build process, then you can try increasing the swap-size of WSL2, and/or decreasing the number of parallel build jobs with the environment +variable MAX_JOBS (by default, it’s equal to the number of logical CPU cores. So, setting MAX_JOBS to 1 is a very conservative approach that would slow things down a lot).

    • +
    +
  • +
  • Problem: Some workloads terminate with an error CL_DEVICE_NOT_FOUND after some time on WSL2.

    +
      +
    • Cause: This issue is due to the TDR feature on Windows.

    • +
    • Solution: Try increasing TDRDelay in your Windows Registry to a large value, such as 20 (it is 2 seconds, by default), and reboot.

    • +
    +
  • +
  • Problem: Random bad termination after AI model convergence test (>24 hours) finishes.

    +
      +
    • Cause: This is a random issue when some AI model convergence test execution finishes. It is not user-friendly as the model execution ends ungracefully.

    • +
    • Solution: Kill the process after the convergence test finished, or use checkpoints to divide the convergence test into several phases and execute separately.

    • +
    +
  • +
  • Problem: Runtime error munmap_chunk(): invalid pointer when executing some scaling LLM workloads on Intel® Data Center GPU Max Series platform

    +
      +
    • Cause: Users targeting GPU use, must set the environment variable ‘FI_HMEM=system’ to disable GPU support in underlying libfabric as Intel® MPI Library 2021.13.1 will offload the GPU support instead. This avoids a potential bug in libfabric GPU initialization.

    • +
    • Solution: Set the environment variable ‘FI_HMEM=system’ to workaround this issue when encounter.

    • +
    +
  • +
+
+
+

Library Dependencies

+
    +
  • Problem: Cannot find oneMKL library when building Intel® Extension for PyTorch* without oneMKL.

    +
    /usr/bin/ld: cannot find -lmkl_sycl
    +/usr/bin/ld: cannot find -lmkl_intel_ilp64
    +/usr/bin/ld: cannot find -lmkl_core
    +/usr/bin/ld: cannot find -lmkl_tbb_thread
    +dpcpp: error: linker command failed with exit code 1 (use -v to see invocation)
    +
    +
    +
      +
    • Cause: When PyTorch* is built with oneMKL library and Intel® Extension for PyTorch* is built without MKL library, this linker issue may occur.

    • +
    • Solution: Resolve the issue by setting:

      +
      export USE_ONEMKL=OFF
      +export MKL_DPCPP_ROOT=${HOME}/intel/oneapi/mkl/latest
      +
      +
      +
    • +
    +

    Then clean build Intel® Extension for PyTorch*.

    +
  • +
  • Problem: Undefined symbol: mkl_lapack_dspevd. Intel MKL FATAL ERROR: cannot load libmkl_vml_avx512.so.2 or `libmkl_vml_def.so.2.

    +
      +
    • Cause: This issue may occur when Intel® Extension for PyTorch* is built with oneMKL library and PyTorch* is not build with any MKL library. The oneMKL kernel may run into CPU backend incorrectly +and trigger this issue.

    • +
    • Solution: Resolve the issue by installing the oneMKL library from conda:

      +
      conda install mkl
      +conda install mkl-include
      +
      +
      +
    • +
    +

    Then clean build PyTorch*.

    +
  • +
  • Problem: OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory.

    +
      +
    • Cause: Wrong MKL library is used when multiple MKL libraries exist in system.

    • +
    • Solution: Preload oneMKL by:

      +
      export LD_PRELOAD=${MKL_DPCPP_ROOT}/lib/intel64/libmkl_intel_lp64.so.2:${MKL_DPCPP_ROOT}/lib/intel64/libmkl_intel_ilp64.so.2:${MKL_DPCPP_ROOT}/lib/intel64/libmkl_gnu_thread.so.2:${MKL_DPCPP_ROOT}/lib/intel64/libmkl_core.so.2:${MKL_DPCPP_ROOT}/lib/intel64/libmkl_sycl.so.2
      +
      +
      +

      If you continue seeing similar issues for other shared object files, add the corresponding files under ${MKL_DPCPP_ROOT}/lib/intel64/ by LD_PRELOAD. Note that the suffix of the libraries may change (e.g. from .1 to .2), if more than one oneMKL library is installed on the system.

      +
    • +
    +
  • +
  • Problem: RuntimeError: could not create an engine.

    +
      +
    • Cause: OCL_ICD_VENDORS path is wrongly set when activate a exist conda environment.

    • +
    • Solution: export OCL_ICD_VENDORS=/etc/OpenCL/vendors after conda activate

    • +
    +
  • +
  • Problem: If you encounter issues related to CCL environment variable configuration when running distributed tasks.

    +
      +
    • Cause: CCL_ROOT path is wrongly set.

    • +
    • Solution: export CCL_ROOT=${CONDA_PREFIX}

    • +
    +
  • +
  • Problem: If you encounter issues related to MPI environment variable configuration when running distributed tasks.

    +
      +
    • Cause: MPI environment variable configuration not correct.

    • +
    • Solution: conda deactivate and then conda activate to activate the correct MPI environment variable automatically.

      +
      conda deactivate
      +conda activate
      +export OCL_ICD_VENDORS=/etc/OpenCL/vendors
      +
      +
      +
    • +
    +
  • +
+
+
+

Performance Issue

+
    +
  • Problem: Extended durations for data transfers from the host system to the device (H2D) and from the device back to the host system (D2H).

    +
      +
    • Cause: Absence of certain Dynamic Kernel Module Support (DKMS) packages on Ubuntu 22.04 or earlier versions.

    • +
    • Solution: For those running Ubuntu 22.04 or below, it’s crucial to follow all the recommended installation procedures, including those labeled as optional. These steps are likely necessary to install the missing DKMS packages and ensure your system is functioning optimally. The Kernel Mode Driver (KMD) package that addresses this issue has been integrated into the Linux kernel for Ubuntu 23.04 and subsequent releases.

    • +
    +
  • +
+
+
+

Unit Test

+
    +
  • Unit test failures on Intel® Data Center GPU Flex Series 170

    +

    The following unit test fails on Intel® Data Center GPU Flex Series 170 but the same test case passes on Intel® Data Center GPU Max Series. The root cause of the failure is under investigation.

    +
      +
    • test_weight_norm.py::TestNNMethod::test_weight_norm_differnt_type

    • +
    +
  • +
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/license.html b/xpu/2.3.110+xpu/tutorials/license.html new file mode 100644 index 000000000..9bd0b57f7 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/license.html @@ -0,0 +1,147 @@ + + + + + + + License — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

License

+

Intel® Extension for PyTorch* is licensed under Apache License Version 2.0. This software includes components that have separate copyright notices and licensing terms. Your use of the source code for these components is subject to the terms and conditions of the following licenses.

+

Apache License Version 2.0:

+

Intel® Extension for PyTorch* LICENSE

+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/llm.html b/xpu/2.3.110+xpu/tutorials/llm.html new file mode 100644 index 000000000..7773120de --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/llm.html @@ -0,0 +1,361 @@ + + + + + + + Large Language Models (LLM) Optimizations Overview — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Large Language Models (LLM) Optimizations Overview

+

In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. LLMs have emerged as the dominant models driving these GenAI applications. Most of LLMs are GPT-like architectures that consist of multiple Decoder layers. +The MultiHeadAttention and FeedForward layer are two key components of every Decoder layer. The generation task is memory bound because iterative decode and kv_cache require special management to reduce memory overheads. Intel® Extension for PyTorch* provides a lot of specific optimizations for these LLMs. +On the operator level, the extension provides highly efficient GEMM kernel to speed up Linear layer and customized operators to reduce the memory footprint. To better trade-off the performance and accuracy, different low-precision solutions e.g., smoothQuant is enabled. Besides, tensor parallel can also adopt to get lower latency for LLMs.

+

These LLM-specific optimizations can be automatically applied with a single frontend API function in Python interface, ipex.llm.optimize(). Check ipex.llm.optimize for more details.

+
+
+
+

Validated Models List

+
+

LLM Inference

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Model Family

Verified < MODEL ID > (Huggingface hub)

FP16

INT4 WOQ

Llama2

“meta-llama/Llama-2-7b-hf”, “meta-llama/Llama-2-13b-hf”, “meta-llama/Llama-2-70b-hf”

Llama3

“meta-llama/Meta-Llama-3-8B”

Phi-3 mini

“microsoft/Phi-3-mini-128k-instruct”

GPT-J

“EleutherAI/gpt-j-6b”

Qwen

“Qwen/Qwen-7B”

OPT

“facebook/opt-30b”, “facebook/opt-1.3b”

Bloom

“bigscience/bloom-7b1”, “bigscience/bloom”

ChatGLM3-6B

“THUDM/chatglm3-6b”

Baichuan2-13B

“baichuan-inc/Baichuan2-13B-Chat”

+

Note: The above verified models (including other models in the same model family, like “codellama/CodeLlama-7b-hf” from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and prepacked TPP Linear (fp16). For other LLMs families, we are working in progress to cover those optimizations, which will expand the model list above.

+
+
+

LLM fine-tuning on Intel® Data Center Max 1550 GPU

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Model Family

Verified < MODEL ID > (Huggingface hub)

Mixed Precision (BF16+FP32)

Full fine-tuning

LoRA

Llama2

“meta-llama/Llama-2-7b-hf”

Llama2

“meta-llama/Llama-2-70b-hf”,

Llama3

“meta-llama/Meta-Llama-3-8B”

Qwen

“Qwen/Qwen-7B”

Phi-3-mini 3.8B

“Phi-3-mini-4k-instruct”

+
+
+

LLM fine-tuning on Intel® Core™ Ultra Processors with Intel® Arc™ Graphics

+ + + + + + + + + + + + + + + + + +

Model Family

Verified < MODEL ID > (Huggingface hub)

Mixed Precision (BF16+FP32)

Full fine-tuning

LoRA

Phi-3-mini 3.8B

“Phi-3-mini-4k-instruct”

+

Check LLM best known practice for instructions to install/setup environment and example scripts..

+
+
+
+

Optimization Methodologies

+

The brief introduction of these optimizations are as following:

+
+

Linear Operator Optimization

+

LLM inference is Linear weight memory bound task. There are three backends to speedup linear GEMM kernels in Intel® Extension for PyTorch*. They are Intel® oneDNN, Intel® Xe Templates for Linear Algebra (XeLTA) and customized linear kernels for weight only quantization.

+
+
+

Deep Fusion Policy

+

Operators fusion is a general approach to reduce the memory access and kernel launch overhead. Except for linear post ops fusion, e.g, linear + activation function, a lot of customized operators are also provided in Intel® Extension for PyTorch* for further performance improvement, for example, Rotary Position Embedding (RoPE) and Root Mean Square Layer Normalization (RMSNorm).

+
+
+

Segment KV Cache

+

KV Cache is used to reduce computation for decoder layer but it also brings memory overheads, for example, when we use beam search, the KV Cache should be reordered according to latest beam idx and the current key/value should also be concatenated with KV Cache in the attention layer to get entire context to do scale dot product attention. When the sequence is very long, memory overheads caused by the reorder_cache and concatenate will be performance bottleneck. Moreover, in standard +implementation, prompt and response key/value will be kept in contiguous +KV Cache buffers for attention context computation, making it memory +wasting to extend the prompt key/value with Beam Width times. Segment KV +Cache is provided to reduce these overheads. Firstly, prompt key/value +will be computed at Prefill phase and kept on device during the decoding +phase, the shapes of which will not be influenced by the Beam Width +value. At decoding phase, we firstly pre-allocate buffers (key and value +use different buffers) to store the response key/value hidden states and +beam index information, then use beam index history which is shown in +the following left figure to decide which beam should be used by a +timestamp and this information will generate an offset to access the KV +Cache buffer which means that the reorder_cache and concat overheads +will be eliminated by this way. The SDPA kernel based on Segment KV +cache policy is shown as the following right figure.

+The beam idx trace for every step +The beam idx trace for every step +
+
+

Distributed Inference

+

All above optimizations already help you to get very good performance +with single GPU card/tile. To further reduce the inference latency and +improve throughput, tensor parallel is also enabled in our solution. You +can firstly use DeepSpeed to auto shard the model and then apply above +optimizations with the frontend API function provided by Intel® +Extension for PyTorch*.

+
+
+

Low Precision Data Types

+

While Generative AI (GenAI) workloads and models are getting more and +more popular, large language models (LLM) used in these workloads are +getting more and more parameters. The increasing size of LLM models +enhances workload accuracies; however, it also leads to significantly +heavier computations and places higher requirements to the underlying +hardware. Given that, quantization becomes a more important methodology +for inference workloads.

+
+
+
+

Weight Only Quantization INT4

+

Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks.

+

However, deploying them on devices with limited resources is challenging due to their high computational and memory requirements.

+

To overcome this issue, we propose quantization methods that reduce the size and complexity of LLMs. Unlike normal quantization, such as w8a8, that quantizes both weights and activations, we focus on Weight-Only Quantization (WOQ), which only quantizes the weights statically. WOQ is a better trade-off between efficiency and accuracy, as the main bottleneck of deploying LLMs is the memory bandwidth and WOQ usually preserves more accuracy. Experiments on Qwen-7B, a large-scale LLM, show that we can obtain accurate quantized models with minimal loss of quality.

+

For more detailed information, check WOQ INT4.

+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/llm/int4_weight_only_quantization.html b/xpu/2.3.110+xpu/tutorials/llm/int4_weight_only_quantization.html new file mode 100644 index 000000000..20998aafb --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/llm/int4_weight_only_quantization.html @@ -0,0 +1,388 @@ + + + + + + + Weight-Only Quantization (Prototype) — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Weight-Only Quantization (Prototype)

+
+

Introduction

+

Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks.

+

However, deploying them on devices with limited resources is challenging due to their high computational and memory requirements.

+

To overcome this issue, we propose quantization methods that reduce the size and complexity of LLMs. Unlike normal quantization, such as w8a8, that quantizes both weights and activations, we focus on Weight-Only Quantization (WOQ), which only quantizes the weights statically. WOQ is a better trade-off between efficiency and accuracy, as the main bottleneck of deploying LLMs is the memory bandwidth and WOQ usually preserves more accuracy. Experiments on Qwen-7B, a large-scale LLM, show that we can obtain accurate quantized models with minimal loss of quality.

+
+
+

Supported Framework Model Matrix

+ + + + + + + + + + + + + + + + + + + + + + + +
Support DeviceRTN*AWQ*TEQ*GPTQ*AutoRound*Data type of quantized weight
GPUstay tuned*stay tuned*stay tuned*stay tuned*int4_fullrange
+ + + + + + + + + + + + + + + + + + + + + + + + + +
ModelDatatypePlatformDeviceAlgorithm
Qwen-7BINT4Intel® Data Center GPU Max Series and Intel® Arc™ A-Series GraphicsIntel® GPURTN
GPT-J-6BINT4Intel® Data Center GPU Max Series and Intel® Arc™ A-Series GraphicsIntel® GPURTN
+

Note: RTN algorithm is supported by Intel® Extension for PyTorch*. For other algorithms, we mark as ‘stay tuned’ and highly recommend you waiting for the availability of the INT4 models on the HuggingFace Model Hub, since the LLM quantization procedure is significantly constrained by the machine’s host memory and computation capabilities.

+
+

RTN[1]: Rounding to Nearest (RTN) is an intuitively simple method that rounds values to the nearest integer. It boasts simplicity, requiring no additional datasets, and offers fast quantization. Besides, it could be easily applied in other datatype like NF4 (non-uniform). Typically, it performs well on configurations such as W4G32 or W8, but worse than advanced algorithms at lower precision level.

+

AWQ[2]: AWQ is a popular method that explores weight min-max values and equivalent transformations in a handcrafted space. While effective, the equivalent transformation imposes certain requirements on model architecture, limiting its applicability to broader models or increasing engineering efforts.

+

TEQ[3]: To our knowledge, it is the first trainable equivalent ransformation method (summited for peer review in 202306). However, it requires more memory than other methods as model-wise loss is used and the equivalent transformation imposes certain requirements on model architecture.

+

GPTQ[4]: GPTQ is a widely adopted method based on the Optimal Brain Surgeon. It quantizes weight block by block and fine-tunes the remaining unquantized ones to mitigate quantization errors. Occasionally, Non-positive semidefinite matrices may occur, necessitating adjustments to hyperparameters.

+

AutoRound[5]: AutoRound utilizes sign gradient descent to optimize rounding values and minmax values of weights within just 200 steps, showcasing impressive performance compared to recent methods like GPTQ/AWQ. Additionally, it offers hypeparameters tuning compatibility to further enhance performance. However, due to its reliance on gradient backpropagation, currently it is not quite fit for backends like ONNX.

+
+

References

+

[1] +Gunho Park, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. +nuqmm: Quantized matmul for efficient inference of large-scale generative language models. +arXiv preprint arXiv:2206.09557, 2022.

+

[2] +Lin, Ji, et al.(2023). +AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. +arXiv preprint arXiv:2306.00978.

+

[3] +Cheng, W., Cai, Y., Lv, K & Shen, H. (2023). +TEQ: Trainable Equivalent Transformation for Quantization of LLMs. +arXiv preprint arXiv:2310.10944.

+

[4] +Frantar, Elias, et al. “Gptq: Accurate post-training quantization for generative pre-trained transformers.” arXiv preprint arXiv:2210.17323 (2022).

+

[5] +Cheng, W., Zhang, W., Shen, H., Cai, Y., He, X., & Lv, K. (2023). +Optimize weight rounding via signed gradient descent for the quantization of llms. +arXiv preprint arXiv:2309.05516.

+
+
+
+

Weight-Only Quantization LLM features in Intel® Extension for PyTorch*

+

In this section, we will describe the implementation of Weight-Only Quantization LLM features in Intel® Extension for PyTorch*. These operators are highly optimized on Intel® GPU platform. +image

+
+

Weight-Only Quantization Initialization

+

On Intel® GPU, the easiest way to load INT4 models is to use the load_in_4bit interface provided by Intel® Extension for Transformers*, which hooks the AutoModelForCausalLM.from_pretrained function to use load_in_4bit on Intel® GPU. Pass the argument load_in_4bit=True to load a model in 4bit when calling the from_pretrained method, which can read the model weight in INT4 format directly.

+
qmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="xpu", trust_remote_code=True, use_llm_runtime=False)
+
+
+

Another option that Intel® Extension for Transformers* offers is to extend the AutoModelForCausalLM.from_pretrained function to allow quantization_config to take WeightOnlyQuantConfig as an argument, which enables conversion on the Intel® GPU platform. We currently support the RTN algorithm and the weight_dtype setting of int4_fullrange (which means that all linear weights are converted to INT4).

+
woq_quantization_config = WeightOnlyQuantConfig(compute_dtype="fp16", weight_dtype="int4_fullrange", scale_dtype="fp16", group_size=64)
+qmodel = AutoModelForCausalLM.from_pretrained(model_name, device_map="xpu", quantization_config=woq_quantization_config, trust_remote_code=True)
+
+
+

In Weight-Only Quantization INT4 case, when using AutoModelForCausalLM.from_pretrained from Intel® Extension for Transformers* to load the model, it will use Intel® Neural Compressor according to the running device to perform quantization deployment.

+
    inc_model = quantization.fit(model,
+                                    conf,
+                                    calib_func=calib_func,
+                                    calib_dataloader=calib_dataloader)
+    model = inc_model.export_compressed_model(compression_dtype=torch.int8,
+                                                compression_dim=0,
+                                                use_optimum_format=False,
+                                                scale_dtype=convert_dtype_str2torch("fp16"))
+
+
+

When running on Intel® GPU, it will replace the linear in the model with WeightOnlyQuantizedLinear. After that, the model linear weight loaded by ipex.llm.optimize is in INT4 format, and it contains not only weight and bias information, but also scales, zero_points, and blocksize information. When optimizing transformers at the front end, Intel® Extension for PyTorch* will use WeightOnlyQuantizedLinear to initialize these information in the model if they are present, otherwise, it will use IPEXTransformerLinear to initialize the linear parameters in the model.

+
+
+

Weight-Only Quantization Runtime

+

On Intel® GPU, after using ipex.llm.optimize, Intel® Extension for PyTorch* will automatically replace the original attention module with IPEXTransformerAttnOptimizedInt4 and the original mlp module with IPEXTransformerMLPOptimizedInt4 in the model.

+

The major changes between IPEXTransformerAttnOptimizedInt4 for INT4 scenario and ipex.llm.optimize for FP16 scenario include: replace the linear used to calculate qkv with torch.ops.torch_ipex.mm_qkv_out_int4 and out_linear with torch.ops.torch_ipex.mm_bias_int4.

+

The major changes between IPEXTransformerMLPOptimizedInt4 for INT4 scenario and ipex.llm.optimize for FP16 scenario include: replace the linear used in mlp with torch.ops.torch_ipex.mm_bias_int4, if activation is used in the mlp module, then correspondingly, it will be replaced with our fused linear+activation kernel, such as torch.ops.torch_ipex.mm_silu_mul_int4.

+
+
+

Weight-Only Quantization Linear Dispatch

+

As explained before, after applying ipex.llm.optimize, The linear kernel that Intel® Extension for PyTorch* has registered to substitute the original linear will be used in the model.

+

The method is:

+

Firstly, a new operator in Intel® Extension for PyTorch* will be registered through IPEX_OP_REGISTER("mm_bias_int4.xpu", at::AtenIpexTypeXPU::mm_bias_int4) and the operator name will be mm_bias_int4.

+

Then HGEMMXetla_INT4 will be used to register the corresponding policy for mm_bias_int4 beforehand. Later, we use policy.run() to make the configured policy take effect.

+

During execution, Intel® Extension for PyTorch* will determine the current running platform according to the machine configuration Settings::I().has_2d_block_array(curDevID) and look for a suitable policy for it. If it is Intel® Data Center GPU Max Series platform, it will use the policy implemented in ORDERED_GEMM_WINT4_CONFIG_SET_PVC. If it is Intel® Arc™ A-Series Graphics platform, it will use the policy implemented in ORDERED_GEMM_WINT4_CONFIG_SET_ARC.

+

After the policy is selected, Intel® Extension for PyTorch* will use HGEMM_INT4_COMMON_DISPATCH to dispatch the operator to different kernels based on different linear configuration parameters and platforms. For example, mm_bias_int4 on the Intel® Arc™ A-Series Graphics platform will be dispatched to the hgemm_bias_wint4_arc kernel.

+
+
+
+

Usage of running Weight-Only Quantization LLM For Intel® GPU

+

Intel® Extension for PyTorch* implements Weight-Only Quantization for Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics with Intel® Extension for Transformers*. Below section uses Qwen-7B to demonstrate the detailed usage.

+
+

Environment Setup

+

Please refer to the instructions.

+
+
+

Run Weight-Only Quantization LLM on Intel® GPU

+
+

Install Intel-extension-for-transformers and Neural-compressor

+
pip install neural-compressor
+pip install intel-extension-for-transformers
+
+
+
+
+

Quantize Model and Inference

+
import intel_extension_for_pytorch as ipex
+from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
+from transformers import AutoTokenizer
+
+device = "xpu"
+model_name = "Qwen/Qwen-7B"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+
+prompt = "Once upon a time, there existed a little girl,"
+inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
+
+# woq_quantization_config = WeightOnlyQuantConfig(compute_dtype="fp16", weight_dtype="int4_fullrange", scale_dtype="fp16", group_size=64)
+# qmodel = AutoModelForCausalLM.from_pretrained(model_name, device_map="xpu", quantization_config=woq_quantization_config, trust_remote_code=True)
+
+qmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="xpu", trust_remote_code=True)
+
+# optimize the model with Intel® Extension for PyTorch*, it will improve performance.
+qmodel = ipex.llm.optimize(qmodel, inplace=True, dtype=torch.float16, woq=True, device="xpu")
+
+output = qmodel.generate(inputs)
+
+
+
+

Note: It is recommended quantizing and saving the model first, then loading the model as below on a GPU device without sufficient device memory. Otherwise you could skip below instruction, execute quantization and inference on your device directly.

+
+
+
+

Save and Load Quantized Model (Optional)

+

+from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
+
+qmodel = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", load_in_4bit=True, device_map="xpu", trust_remote_code=True)
+
+# Please note, saving model should be executed before ipex.llm.optimize function is called. 
+model.save_pretrained("saved_dir")
+
+# Load model
+loaded_model = AutoModelForCausalLM.from_pretrained("saved_dir", trust_remote_code=True)
+
+# Before executed the loaded model, you can call ipex.llm.optimize function.
+loaded_model = ipex.llm.optimize(loaded_model, inplace=True, dtype=torch.float16, woq=True, device="xpu")
+
+output = loaded_model.generate(inputs)
+
+
+
+
+

Execute WOQ benchmark script

+
bash run_benchmark_woq.sh
+
+
+
+

Note:

+
    +
  • Do save quantized model before call ipex.llm.optimize function.

  • +
  • The ipex.llm.optimize function is designed to optimize transformer-based models within frontend python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. Please refer to Transformers Optimization Frontend API for the detail of ipex.llm.optimize.

  • +
+
+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/llm/llm_optimize_transformers.html b/xpu/2.3.110+xpu/tutorials/llm/llm_optimize_transformers.html new file mode 100644 index 000000000..ca8133cfd --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/llm/llm_optimize_transformers.html @@ -0,0 +1,276 @@ + + + + + + + Transformers Optimization Frontend API — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Transformers Optimization Frontend API

+

The new API function, ipex.llm.optimize, is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. You just need to invoke the ipex.llm.optimize function instead of the ipex.optimize function to apply all optimizations transparently.

+

This API currently works for inference workloads. Support for training is undergoing. Currently, this API supports certain models. Supported model list can be found at Overview.

+

API documentation is available at API Docs page.

+
+

Pseudocode of Common Usage Scenarios

+

The following sections show pseudocode snippets to invoke Intel® Extension for PyTorch* APIs to work with LLMs. Complete examples can be found at the Example directory.

+
+

FP16

+
import torch
+import intel_extension_for_pytorch as ipex
+import transformers
+
+
+device = "xpu"
+model= transformers.AutoModelForCausalLM(model_name_or_path).eval().to(device)
+
+amp_dtype = torch.float16 
+model = ipex.llm.optimize(model.eval(), dtype=amp_dtype, device=device, inplace=True)
+
+# inference with model.generate()
+...
+
+
+
+
+

SmoothQuant

+

Supports INT8.

+
+

Imperative mode

+
import torch
+import intel_extension_for_pytorch
+
+# Define model
+model = Model().to("xpu")
+model.eval()
+modelImpe = torch.quantization.QuantWrapper(model)
+
+# Define QConfig
+qconfig = torch.quantization.QConfig(activation=torch.quantization.observer.MinMaxObserver .with_args(qscheme=torch.per_tensor_symmetric),
+    weight=torch.quantization.default_weight_observer)  # weight could also be perchannel
+
+modelImpe.qconfig = qconfig
+
+# Prepare model for inserting observer
+torch.quantization.prepare(modelImpe, inplace=True)
+
+# Calibration to obtain statistics for Observer
+for data in calib_dataset:
+    modelImpe(data)
+
+# Convert model to create a quantized module
+torch.quantization.convert(modelImpe, inplace=True)
+
+# Inference
+modelImpe(inference_data)
+
+
+
+
+

TorchScript Mode

+

+import torch
+import intel_extension_for_pytorch
+from torch.quantization.quantize_jit import (
+    convert_jit,
+    prepare_jit,
+)
+
+# Define model
+model = Model().to("xpu")
+model.eval()
+
+# Generate a ScriptModule
+modelJit = torch.jit.trace(model, example_input) # or torch.jit.script(model)
+
+# Defin QConfig
+qconfig = torch.quantization.QConfig(
+    activation=torch.quantization.observer.MinMaxObserver.with_args(
+        qscheme=qscheme,
+        reduce_range=False,
+        dtype=dtype
+    ),
+    weight=torch.quantization.default_weight_observer
+)
+
+# Prepare model for inserting observer
+modelJit = prepare_jit(modelJit, {'': qconfig}, inplace=True)
+
+# Calibration 
+for data in calib_dataset:
+    modelJit(data)
+
+# Convert model to quantized one
+modelJit = convert_jit(modelJit)
+
+# Warmup to fully trigger fusion patterns
+for i in range(5):
+    modelJit(warmup_data) 
+# Inference
+modelJit(inference_data)
+
+# Debug
+print(modelJit.graph_for(inference_dta))
+
+
+
+
+
+

Distributed Inference with DeepSpeed

+

Distributed inference can be performed with DeepSpeed. Based on original Intel® Extension for PyTorch* scripts, the following code changes are required.

+

Check Distributed Examples in LLM example for complete codes.

+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/performance.html b/xpu/2.3.110+xpu/tutorials/performance.html new file mode 100644 index 000000000..e0cff9ca6 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/performance.html @@ -0,0 +1,318 @@ + + + + + + + Performance — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Performance

+
+

Overview

+

This page shows performance boost with Intel® Extension for PyTorch* on several popular topologies.

+
+
+

Performance Data for Intel® AI Data Center Products

+

Find the latest performance data for Intel® Data Center Max 1550 GPU, including detailed hardware and software configurations.

+
+
+

LLM Performance v2.1.10

+

We benchmarked GPT-J 6B, LLaMA2 7B, 13B, OPT 6.7B, Bloom-7B with test input token length set to 1024. The datatype is FP16 for all the models.

+

Single Tile

+

Single Card

+

Two Card

+

Four Card

+
+

Configuration

+
+

Software Version

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SoftwareVersion
PyTorchv2.1
Intel® Extension for PyTorch*v2.1.10+xpu
Intel® oneAPI Base Toolkit2024.0
Torch-CCL2.1.100
GPU Driver736.25
Transformersv4.31.0
DeepSpeedcommit 4fc181b0
Intel® Extension for DeepSpeed*commit ec33277
+
+

Hardware Configuration

+

CPU Configuration:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CPUIntel(R) Xeon(R) Platinum 8480+ CPU
Number of nodes1
Number of sockets2
Cores/Socket56
Threads/Core2
uCode0x2b0004b1
Hyper-ThreadingON
TurboBoostON
BIOS versionSE5C7411.86B.9525.D25.2304190630
Number of DDR Memory slots16
Capacity of DDR memory per slot64GB
DDR frequency4800
Total Memory/Node (DDR+DCPMM)1024GB
Host OSUbuntu 22.04.3 LTS
Host Kernel5.17.0-1020-oem
Spectre-Meltdown MitigationMitigated

Single tile of 4X PVC OAM Configuration:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
GPUIntel(R) Data Center Max 1550 GPU
IFWIPVC.PS.B4.P.Si.2023.WW42.3_25MHzi_Quad_DAMeni_OAM600W_IFRv2332i_PSCnull_IFWI.bin
ECCON
AMC SWAMC FW 6.2
PrecisionFP16
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/releases.html b/xpu/2.3.110+xpu/tutorials/releases.html new file mode 100644 index 000000000..b7bc6d2b1 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/releases.html @@ -0,0 +1,576 @@ + + + + + + + Releases — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Releases

+
+

2.3.110+xpu

+

Intel® Extension for PyTorch* v2.3.110+xpu is the new release which supports Intel® GPU platforms (Intel® Data Center GPU Flex Series, Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics) based on PyTorch* 2.3.1.

+
+

Highlights

+
    +
  • Intel® oneDNN v3.5.3 integration

  • +
  • Intel® oneAPI Base Toolkit 2024.2.1 compatibility

  • +
  • Large Language Model (LLM) optimization

    +

    Intel® Extension for PyTorch* provides a new dedicated module, ipex.llm, to host for Large Language Models (LLMs) specific APIs. With ipex.llm, Intel® Extension for PyTorch* provides comprehensive LLM optimization on FP16 and INT4 datatypes. Specifically for low precision, Weight-Only Quantization is supported for various scenarios. And user can also run Intel® Extension for PyTorch* with Tensor Parallel to fit in the multiple ranks or multiple nodes scenarios to get even better performance.

    +

    A typical API under this new module is ipex.llm.optimize, which is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. ipex.llm.optimize is an upgrade API to replace previous ipex.optimize_transformers, which will bring you more consistent LLM experience and performance. Below shows a simple example of ipex.llm.optimize for fp16 inference:

    +
      import torch
    +  import intel_extension_for_pytorch as ipex
    +  import transformers
    +
    +  model= transformers.AutoModelForCausalLM.from_pretrained(model_name_or_path).eval()
    +
    +  dtype = torch.float16
    +  model = ipex.llm.optimize(model, dtype=dtype, device="xpu")
    +
    +  model.generate(YOUR_GENERATION_PARAMS)
    +
    +
    +

    More examples of this API can be found at LLM optimization API.

    +

    Besides that, we optimized more LLM inference models. A full list of optimized models can be found at LLM Optimizations Overview.

    +
  • +
  • Serving framework support

    +

    Typical LLM serving frameworks including vLLM and TGI can co-work with Intel® Extension for PyTorch* on Intel® GPU platforms (Intel® Data Center GPU Max 1550 and Intel® Arc™ A-Series Graphics). Besides the integration of LLM serving frameworks with ipex.llm module level APIs, we enhanced the performance and quality of underneath Intel® Extension for PyTorch* operators such as paged attention and flash attention for better end to end model performance.

    +
  • +
  • Prototype support of full fine-tuning and LoRA PEFT with mixed precision

    +

    Intel® Extension for PyTorch* also provides new capability for supporting popular recipes with both full fine-tuning and LoRA PEFT for mixed precision with BF16 and FP32. We optimized many typical LLM models including Llama 2 (7B and 70B), Llama 3 8B, Phi-3-Mini 3.8B model families and Chinese model Qwen-7B, on both single GPU and Multi-GPU (distributed fine-tuning based on PyTorch FSDP) use cases.

    +
  • +
+
+
+

Known Issues

+

Please refer to Known Issues webpage.

+
+
+
+

2.1.40+xpu

+

Intel® Extension for PyTorch* v2.1.40+xpu is a minor release which supports Intel® GPU platforms (Intel® Data Center GPU Flex Series, Intel® Data Center GPU Max Series,Intel® Arc™ A-Series Graphics and Intel® Core™ Ultra Processors with Intel® Arc™ Graphics) based on PyTorch* 2.1.0.

+
+

Highlights

+
    +
  • Intel® oneAPI Base Toolkit 2024.2.1 compatibility

  • +
  • Intel® oneDNN v3.5 integration

  • +
  • Intel® oneCCL 2021.13.1 integration

  • +
  • Intel® Core™ Ultra Processors with Intel® Arc™ Graphics (MTL-H) support on Windows (Prototype)

  • +
  • Bug fixing and other optimization

    +
      +
    • Fix host memory leak #4280

    • +
    • Fix LayerNorm issue for undefined grad_input #4317

    • +
    • Replace FP64 device check method #4354

    • +
    • Fix online doc search issue #4358

    • +
    • Fix pdist unit test failure on client GPUs #4361

    • +
    • Remove primitive cache from conv fwd #4429

    • +
    • Fix sdp bwd page fault with no grad bias #4439

    • +
    • Fix implicit data conversion #4463

    • +
    • Fix compiler version parsing issue #4468

    • +
    • Fix irfft invalid descriptor #4480

    • +
    • Change condition order to fix out-of-bound access in index #4495

    • +
    • Add parameter check in embedding bag #4504

    • +
    • Add the backward implementation for rms norm #4527

    • +
    • Fix attn_mask for sdpa beam_search #4557

    • +
    • Use data_ptr template instead of force data conversion #4558

    • +
    • Workaround windows AOT image size over 2GB issue on Intel® Core™ Ultra Processors with Intel® Arc™ Graphics #4407 #4450

    • +
    +
  • +
+
+
+

Known Issues

+

Please refer to Known Issues webpage.

+
+
+
+

2.1.30+xpu

+

Intel® Extension for PyTorch* v2.1.30+xpu is an update release which supports Intel® GPU platforms (Intel® Data Center GPU Flex Series, Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics) based on PyTorch* 2.1.0.

+
+

Highlights

+
    +
  • Intel® oneDNN v3.4.1 integration

  • +
  • Intel® oneAPI Base Toolkit 2024.1 compatibility

  • +
  • Large Language Model (LLM) optimizations for FP16 inference on Intel® Data Center GPU Max Series (Beta): Intel® Extension for PyTorch* provides a lot of specific optimizations for LLM workloads in this release on Intel® Data Center GPU Max Series. In operator level, we provide highly efficient GEMM kernel to speed up Linear layer and customized fused operators to reduce HBM access/kernel launch overhead. To reduce memory footprint, we define a segment KV Cache policy to save device memory and improve the throughput. Such optimizations are added in this release to enhance existing optimized LLM FP16 models and more Chinese LLM models such as Baichuan2-13B, ChatGLM3-6B and Qwen-7B.

  • +
  • LLM optimizations for INT4 inference on Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics (Prototype): Intel® Extension for PyTorch* shows remarkable performance when executing LLM models on Intel® GPU. However, deploying such models on GPUs with limited resources is challenging due to their high computational and memory requirements. To achieve a better trade-off, a low-precision solution, e.g., weight-only-quantization for INT4 is enabled to allow Llama 2-7B, GPT-J-6B and Qwen-7B to be executed efficiently on Intel® Arc™ A-Series Graphics. The same optimization makes INT4 models achieve 1.5x speeded up in total latency performance compared with FP16 models with the same configuration and parameters on Intel® Data Center GPU Max Series.

  • +
  • Opt-in collective performance optimization with oneCCL Bindings for Pytorch*: This opt-in feature can be enabled by setting TORCH_LLM_ALLREDUCE=1 to provide better scale-up performance by enabling optimized collectives such as allreduce, allgather, reducescatter algorithms in Intel® oneCCL. This feature requires XeLink enabled for cross-cards communication.

  • +
+
+
+

Known Issues

+

Please refer to Known Issues webpage.

+
+
+
+

2.1.20+xpu

+

Intel® Extension for PyTorch* v2.1.20+xpu is a minor release which supports Intel® GPU platforms (Intel® Data Center GPU Flex Series, Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics) based on PyTorch* 2.1.0.

+
+

Highlights

+
    +
  • Intel® oneAPI Base Toolkit 2024.1 compatibility

  • +
  • Intel® oneDNN v3.4 integration

  • +
  • LLM inference scaling optimization based on Intel® oneCCL 2021.12 (Prototype)

  • +
  • Bug fixing and other optimization

    +
      +
    • Uplift XeTLA to v0.3.4.1 #3696

    • +
    • [SDP] Fallback unsupported bias size to native impl #3706

    • +
    • Error handling enhancement #3788, #3841

    • +
    • Fix beam search accuracy issue in workgroup reduce #3796

    • +
    • Support int32 index tensor in index operator #3808

    • +
    • Add deepspeed in LLM dockerfile #3829

    • +
    • Fix batch norm accuracy issue #3882

    • +
    • Prebuilt wheel dockerfile update #3887, #3970

    • +
    • Fix windows build failure with Intel® oneMKL 2024.1 in torch_patches #18

    • +
    • Fix FFT core dump issue with Intel® oneMKL 2024.1 in torch_patches #20, #21

    • +
    +
  • +
+
+
+

Known Issues

+

Please refer to Known Issues webpage.

+
+
+
+

2.1.10+xpu

+

Intel® Extension for PyTorch* v2.1.10+xpu is the new Intel® Extension for PyTorch* release supports both CPU platforms and GPU platforms (Intel® Data Center GPU Flex Series, Intel® Data Center GPU Max Series and Intel® Arc™ A-Series Graphics) based on PyTorch* 2.1.0. It extends PyTorch* 2.1.0 with up-to-date features and optimizations on xpu for an extra performance boost on Intel hardware. Optimizations take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, through PyTorch* xpu device, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs with PyTorch*.

+
+

Highlights

+

This release provides the following features:

+
    +
  • Large Language Model (LLM) optimizations for FP16 inference on Intel® Data Center GPU Max Series (Prototype): Intel® Extension for PyTorch* provides a lot of specific optimizations for LLM workloads on Intel® Data Center GPU Max Series in this release. In operator level, we provide highly efficient GEMM kernel to speedup Linear layer and customized fused operators to reduce HBM access and kernel launch overhead. To reduce memory footprint, we define a segment KV Cache policy to save device memory and improve the throughput. To better trade-off the performance and accuracy, low-precision solution e.g., weight-only-quantization for INT4 is enabled. Besides, tensor parallel can also be adopted to get lower latency for LLMs.

    +
      +
    • A new API function, ipex.optimize_transformers, is designed to optimize transformer-based models within frontend Python modules, with a particular focus on LLMs. It provides optimizations for both model-wise and content-generation-wise. You just need to invoke the ipex.optimize_transformers API instead of the ipex.optimize API to apply all optimizations transparently. More detailed information can be found at Large Language Model optimizations overview.

    • +
    • A typical usage of this new feature is quite simple as below:

      +
      import torch
      +import intel_extension_for_pytorch as ipex
      +...
      +model = ipex.optimize_transformers(model, dtype=dtype)
      +
      +
      +
    • +
    +
  • +
  • Torch.compile functionality on Intel® Data Center GPU Max Series (Beta): Extends Intel® Extension for PyTorch* capabilities to support torch.compile APIs on Intel® Data Center GPU Max Series. And provides Intel GPU support on top of Triton* compiler to reach competitive performance speed-up over eager mode by default “inductor” backend of Intel® Extension for PyTorch*.

  • +
  • Intel® Arc™ A-Series Graphics on WSL2, native Windows and native Linux are officially supported in this release. Intel® Arc™ A770 Graphic card has been used as primary verification vehicle for product level test.

  • +
  • Other features are listed as following, more detailed information can be found in public documentation:

    +
      +
    • FP8 datatype support (Prototype): Add basic data type and FP8 Linear operator support based on emulation kernel.

    • +
    • Kineto Profiling (Prototype): An extension of PyTorch* profiler for profiling operators on Intel® GPU devices.

    • +
    • Fully Sharded Data Parallel (FSDP): Support new PyTorch* FSDP API which provides an industry-grade solution for large-scale model training.

    • +
    • Asymmetric INT8 quantization: Support asymmetric quantization to align with stock PyTorch* and provide better accuracy in INT8.

    • +
    +
  • +
  • CPU support has been merged in this release. CPU features and optimizations are equivalent to what has been released in Intel® Extension for PyTorch* v2.1.0+cpu release that was made publicly available in Oct 2023. For customers who would like to evaluate workloads on both GPU and CPU, they can use this package. For customers who are focusing on CPU only, we still recommend them to use Intel® Extension for PyTorch* v2.1.0+cpu release for smaller footprint, less dependencies and broader OS support.

  • +
+
+
+

Known Issues

+

Please refer to Known Issues webpage.

+
+
+
+

2.0.110+xpu

+

Intel® Extension for PyTorch* v2.0.110+xpu is the new Intel® Extension for PyTorch* release supports both CPU platforms and GPU platforms (Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series) based on PyTorch* 2.0.1. It extends PyTorch* 2.0.1 with up-to-date features and optimizations on xpu for an extra performance boost on Intel hardware. Optimizations take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, through PyTorch* xpu device, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs with PyTorch*.

+
+

Highlights

+

This release introduces specific XPU solution optimizations on Intel discrete GPUs which include Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series. Optimized operators and kernels are implemented and registered through PyTorch* dispatching mechanism for the xpu device. These operators and kernels are accelerated on Intel GPU hardware from the corresponding native vectorization and matrix calculation features. In graph mode, additional operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.

+

This release provides the following features:

+
    +
  • oneDNN 3.3 API integration and adoption

  • +
  • Libtorch support

  • +
  • ARC support on Windows, WSL2 and Ubuntu (Prototype)

  • +
  • OOB models improvement

    +
      +
    • More fusion patterns enabled for optimizing OOB models

    • +
    +
  • +
  • CPU support is merged in this release:

    +
      +
    • CPU features and optimizations are equivalent to what has been released in Intel® Extension for PyTorch* v2.0.100+cpu release that was made publicly available in May 2023. For customers who would like to evaluate workloads on both GPU and CPU, they can use this package. For customers who are focusing on CPU only, we still recommend them to use Intel® Extension for PyTorch* v2.0.100+cpu release for smaller footprint, less dependencies and broader OS support.

    • +
    +
  • +
+

This release adds the following fusion patterns in PyTorch* JIT mode for Intel GPU:

+
    +
  • add + softmax

  • +
  • add + view + softmax

  • +
+
+
+

Known Issues

+

Please refer to Known Issues webpage.

+
+
+
+

1.13.120+xpu

+

Intel® Extension for PyTorch* v1.13.120+xpu is the updated Intel® Extension for PyTorch* release supports both CPU platforms and GPU platforms (Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series) based on PyTorch* 1.13.1. It extends PyTorch* 1.13.1 with up-to-date features and optimizations on xpu for an extra performance boost on Intel hardware. Optimizations take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, through PyTorch* xpu device, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs with PyTorch*.

+
+

Highlights

+

This release introduces specific XPU solution optimizations on Intel discrete GPUs which include Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series. Optimized operators and kernels are implemented and registered through PyTorch* dispatching mechanism for the xpu device. These operators and kernels are accelerated on Intel GPU hardware from the corresponding native vectorization and matrix calculation features. In graph mode, additional operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.

+

This release provides the following features:

+
    +
  • oneDNN 3.1 API integration and adoption

  • +
  • OOB models improvement

    +
      +
    • More fusion patterns enabled for optimizing OOB models

    • +
    +
  • +
  • CPU support is merged in this release:

    +
      +
    • CPU features and optimizations are equivalent to what has been released in Intel® Extension for PyTorch* v1.13.100+cpu release that was made publicly available in Feb 2023. For customers who would like to evaluate workloads on both GPU and CPU, they can use this package. For customers who are focusing on CPU only, we still recommend them to use Intel® Extension for PyTorch* v1.13.100+cpu release for smaller footprint, less dependencies and broader OS support.

    • +
    +
  • +
+

This release adds the following fusion patterns in PyTorch* JIT mode for Intel GPU:

+
    +
  • Matmul + UnaryOp(abs, sqrt, square, exp, log, round, Log_Sigmoid, Hardswish, HardSigmoid, Pow, ELU, SiLU, hardtanh, Leaky_relu)

  • +
  • Conv2d + BinaryOp(add, sub, mul, div, max, min, eq, ne, ge, gt, le, lt)

  • +
  • Linear + BinaryOp(add, sub, mul, div, max, min)

  • +
  • Conv2d + mul + add

  • +
  • Conv2d + mul + add + relu

  • +
  • Conv2d + sigmoid + mul + add

  • +
  • Conv2d + sigmoid + mul + add + relu

  • +
+
+
+

Known Issues

+

Please refer to Known Issues webpage.

+
+
+
+

1.13.10+xpu

+

Intel® Extension for PyTorch* v1.13.10+xpu is the first Intel® Extension for PyTorch* release supports both CPU platforms and GPU platforms (Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series) based on PyTorch* 1.13. It extends PyTorch* 1.13 with up-to-date features and optimizations on xpu for an extra performance boost on Intel hardware. Optimizations take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs as well as Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, through PyTorch* xpu device, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs with PyTorch*.

+
+

Highlights

+

This release introduces specific XPU solution optimizations on Intel discrete GPUs which include Intel® Data Center GPU Flex Series and Intel® Data Center GPU Max Series. Optimized operators and kernels are implemented and registered through PyTorch* dispatching mechanism for the xpu device. These operators and kernels are accelerated on Intel GPU hardware from the corresponding native vectorization and matrix calculation features. In graph mode, additional operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.

+

This release provides the following features:

+
    +
  • Distributed Training on GPU:

    +
      +
    • support of distributed training with DistributedDataParallel (DDP) on Intel GPU hardware

    • +
    • support of distributed training with Horovod (prototype feature) on Intel GPU hardware

    • +
    +
  • +
  • Automatic channels last format conversion on GPU:

    +
      +
    • Automatic channels last format conversion is enabled. Models using torch.xpu.optimize API running on Intel® Data Center GPU Max Series will be converted to channels last memory format, while models running on Intel® Data Center GPU Flex Series will choose oneDNN block format.

    • +
    +
  • +
  • CPU support is merged in this release:

    +
      +
    • CPU features and optimizations are equivalent to what has been released in Intel® Extension for PyTorch* v1.13.0+cpu release that was made publicly available in Nov 2022. For customers who would like to evaluate workloads on both GPU and CPU, they can use this package. For customers who are focusing on CPU only, we still recommend them to use Intel® Extension for PyTorch* v1.13.0+cpu release for smaller footprint, less dependencies and broader OS support.

    • +
    +
  • +
+

This release adds the following fusion patterns in PyTorch* JIT mode for Intel GPU:

+
    +
  • Conv2D + UnaryOp(abs, sqrt, square, exp, log, round, GeLU, Log_Sigmoid, Hardswish, Mish, HardSigmoid, Tanh, Pow, ELU, hardtanh)

  • +
  • Linear + UnaryOp(abs, sqrt, square, exp, log, round, Log_Sigmoid, Hardswish, HardSigmoid, Pow, ELU, SiLU, hardtanh, Leaky_relu)

  • +
+
+
+

Known Issues

+

Please refer to Known Issues webpage.

+
+
+
+

1.10.200+gpu

+

Intel® Extension for PyTorch* v1.10.200+gpu extends PyTorch* 1.10 with up-to-date features and optimizations on XPU for an extra performance boost on Intel Graphics cards. XPU is a user visible device that is a counterpart of the well-known CPU and CUDA in the PyTorch* community. XPU represents an Intel-specific kernel and graph optimizations for various “concrete” devices. The XPU runtime will choose the actual device when executing AI workloads on the XPU device. The default selected device is Intel GPU. XPU kernels from Intel® Extension for PyTorch* are written in DPC++ that supports SYCL language and also a number of DPC++ extensions.

+
+

Highlights

+

This release introduces specific XPU solution optimizations on Intel® Data Center GPU Flex Series 170. Optimized operators and kernels are implemented and registered through PyTorch* dispatching mechanism for the XPU device. These operators and kernels are accelerated on Intel GPU hardware from the corresponding native vectorization and matrix calculation features. In graph mode, additional operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.

+

This release provides the following features:

+
    +
  • Auto Mixed Precision (AMP)

    +
      +
    • support of AMP with BFloat16 and Float16 optimization of GPU operators

    • +
    +
  • +
  • Channels Last

    +
      +
    • support of channels_last (NHWC) memory format for most key GPU operators

    • +
    +
  • +
  • DPC++ Extension

    +
      +
    • mechanism to create PyTorch* operators with custom DPC++ kernels running on the XPU device

    • +
    +
  • +
  • Optimized Fusion

    +
      +
    • support of SGD/AdamW fusion for both FP32 and BF16 precision

    • +
    +
  • +
+

This release supports the following fusion patterns in PyTorch* JIT mode:

+
    +
  • Conv2D + ReLU

  • +
  • Conv2D + Sum

  • +
  • Conv2D + Sum + ReLU

  • +
  • Pad + Conv2d

  • +
  • Conv2D + SiLu

  • +
  • Permute + Contiguous

  • +
  • Conv3D + ReLU

  • +
  • Conv3D + Sum

  • +
  • Conv3D + Sum + ReLU

  • +
  • Linear + ReLU

  • +
  • Linear + Sigmoid

  • +
  • Linear + Div(scalar)

  • +
  • Linear + GeLu

  • +
  • Linear + GeLu_

  • +
  • T + Addmm

  • +
  • T + Addmm + ReLu

  • +
  • T + Addmm + Sigmoid

  • +
  • T + Addmm + Dropout

  • +
  • T + Matmul

  • +
  • T + Matmul + Add

  • +
  • T + Matmul + Add + GeLu

  • +
  • T + Matmul + Add + Dropout

  • +
  • Transpose + Matmul

  • +
  • Transpose + Matmul + Div

  • +
  • Transpose + Matmul + Div + Add

  • +
  • MatMul + Add

  • +
  • MatMul + Div

  • +
  • Dequantize + PixelShuffle

  • +
  • Dequantize + PixelShuffle + Quantize

  • +
  • Mul + Add

  • +
  • Add + ReLU

  • +
  • Conv2D + Leaky_relu

  • +
  • Conv2D + Leaky_relu_

  • +
  • Conv2D + Sigmoid

  • +
  • Conv2D + Dequantize

  • +
  • Softplus + Tanh

  • +
  • Softplus + Tanh + Mul

  • +
  • Conv2D + Dequantize + Softplus + Tanh + Mul

  • +
  • Conv2D + Dequantize + Softplus + Tanh + Mul + Quantize

  • +
  • Conv2D + Dequantize + Softplus + Tanh + Mul + Quantize + Add

  • +
+
+
+

Known Issues

+
    +
  • [CRITICAL ERROR] Kernel ‘XXX’ removed due to usage of FP64 instructions unsupported by the targeted hardware

    +

    FP64 is not natively supported by the Intel® Data Center GPU Flex Series platform. If you run any AI workload on that platform and receive this error message, it means a kernel requiring FP64 instructions is removed and not executed, hence the accuracy of the whole workload is wrong.

    +
  • +
  • symbol undefined caused by _GLIBCXX_USE_CXX11_ABI

    +
    ImportError: undefined symbol: _ZNK5torch8autograd4Node4nameB5cxx11Ev
    +
    +
    +

    DPC++ does not support _GLIBCXX_USE_CXX11_ABI=0, Intel® Extension for PyTorch* is always compiled with _GLIBCXX_USE_CXX11_ABI=1. This symbol undefined issue appears when PyTorch* is compiled with _GLIBCXX_USE_CXX11_ABI=0. Update PyTorch* CMAKE file to set _GLIBCXX_USE_CXX11_ABI=1 and compile PyTorch* with particular compiler which supports _GLIBCXX_USE_CXX11_ABI=1. We recommend to use gcc version 9.4.0 on ubuntu 20.04.

    +
  • +
  • Can’t find oneMKL library when build Intel® Extension for PyTorch* without oneMKL

    +
    /usr/bin/ld: cannot find -lmkl_sycl
    +/usr/bin/ld: cannot find -lmkl_intel_ilp64
    +/usr/bin/ld: cannot find -lmkl_core
    +/usr/bin/ld: cannot find -lmkl_tbb_thread
    +dpcpp: error: linker command failed with exit code 1 (use -v to see invocation)
    +
    +
    +

    When PyTorch* is built with oneMKL library and Intel® Extension for PyTorch* is built without oneMKL library, this linker issue may occur. Resolve it by setting:

    +
    export USE_ONEMKL=OFF
    +export MKL_DPCPP_ROOT=${PATH_To_Your_oneMKL}/__release_lnx/mkl
    +
    +
    +

    Then clean build Intel® Extension for PyTorch*.

    +
  • +
  • undefined symbol: mkl_lapack_dspevd. Intel MKL FATAL ERROR: cannot load libmkl_vml_avx512.so.2 or libmkl_vml_def.so.2

    +

    This issue may occur when Intel® Extension for PyTorch* is built with oneMKL library and PyTorch* is not build with any MKL library. The oneMKL kernel may run into CPU backend incorrectly and trigger this issue. Resolve it by installing MKL library from conda:

    +
    conda install mkl
    +conda install mkl-include
    +
    +
    +

    then clean build PyTorch*.

    +
  • +
  • OSError: libmkl_intel_lp64.so.1: cannot open shared object file: No such file or directory

    +

    Wrong MKL library is used when multiple MKL libraries exist in system. Preload oneMKL by:

    +
    export LD_PRELOAD=${MKL_DPCPP_ROOT}/lib/intel64/libmkl_intel_lp64.so.1:${MKL_DPCPP_ROOT}/lib/intel64/libmkl_intel_ilp64.so.1:${MKL_DPCPP_ROOT}/lib/intel64/libmkl_sequential.so.1:${MKL_DPCPP_ROOT}/lib/intel64/libmkl_core.so.1:${MKL_DPCPP_ROOT}/lib/intel64/libmkl_sycl.so.1
    +
    +
    +

    If you continue seeing similar issues for other shared object files, add the corresponding files under ${MKL_DPCPP_ROOT}/lib/intel64/ by LD_PRELOAD. Note that the suffix of the libraries may change (e.g. from .1 to .2), if more than one oneMKL library is installed on the system.

    +
  • +
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/technical_details.html b/xpu/2.3.110+xpu/tutorials/technical_details.html new file mode 100644 index 000000000..517328a29 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/technical_details.html @@ -0,0 +1,204 @@ + + + + + + + Technical Details — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Technical Details

+
+

Optimizer Optimization [GPU]

+

Optimizers are a key part of the training workloads. Intel® Extension for PyTorch* brings two types of optimizations to optimizers:

+
    +
  1. Operator fusion for the computation in the optimizers. [GPU]

  2. +
+
+
+

For more detailed information, check Optimizer Fusion on GPU.

+
+
+

Ahead of Time Compilation (AOT) [GPU]

+

AOT Compilation is a helpful feature for development lifecycle or distribution time, when you know beforehand what your target device is going to be at application execution time. When AOT compilation is enabled, no additional compilation time is needed when running application. It also benifits the product quality since no just-in-time (JIT) bugs encountered as JIT is skipped and final code executing on the target device can be tested as-is before delivery to end-users. The disadvantage of this feature is that the final distributed binary size will be increased a lot (e.g. from 500MB to 2.5GB for Intel® Extension for PyTorch*).

+
+
+
+
+

Memory Management [GPU]

+

Intel® Extension for PyTorch* uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without any overhead. +Allocations are associated with a sycl device. The allocator attempts to find the smallest cached block that will fit the requested size from the reserved block pool. +If it unable to find a appropriate memory block inside of already allocated ares, the allocator will delegate to allocate a new block memory.

+

For more detailed information, check Memory Management.

+
+
+
+
+

ipex.optimize [GPU]

+

The ipex.optimize API is designed to optimize PyTorch* modules +(nn.modules) and specific optimizers within Python modules. Its +optimization options for Intel® GPU device include:

+
    +
  • Automatic Channels Last

  • +
  • Fusing Convolutional Layers with Batch Normalization

  • +
  • Fusing Linear Layers with Batch Normalization

  • +
  • Replacing Dropout with Identity

  • +
  • Splitting Master Weights

  • +
  • Fusing Optimizer Update Step

  • +
+

For more detailed information, check ipex.optimize.

+
+
+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/technical_details/AOT.html b/xpu/2.3.110+xpu/tutorials/technical_details/AOT.html new file mode 100644 index 000000000..185d91c78 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/technical_details/AOT.html @@ -0,0 +1,195 @@ + + + + + + + Ahead of Time (AOT) Compilation — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Ahead of Time (AOT) Compilation

+
+

Introduction

+

AOT Compilation is a helpful feature for development lifecycle or distribution time, when you know beforehand what your target device is going to be at application execution time. When AOT compilation is enabled, no additional compilation time is needed when running application. It also benifits the product quality since no just-in-time (JIT) bugs encountered as JIT is skipped and final code executing on the target device can be tested as-is before delivery to end-users. The disadvantage of this feature is that the final distributed binary size will be increased a lot (e.g. from 500MB to 2.5GB for Intel® Extension for PyTorch*).

+
+
+

Use case

+

Intel® Extension for PyTorch* provides build option USE_AOT_DEVLIST for users who install Intel® Extension for PyTorch* via source compilation to configure device list for AOT compilation. The target device in device list is specified by DEVICE type of the target. Multi-target AOT compilation is supported by using a comma (,) as a delimiter in device list. See below table for the AOT setting targeting Intel® Data Center GPU Flex Series & Intel® Arc™ A-Series GPUs.

+ + + + + + + + + + + + + + + + + + + + + +
Supported HWAOT Setting
Intel® Data Center GPU Flex Series 170USE_AOT_DEVLIST='ats-m150'
Intel® Data Center GPU Max SeriesUSE_AOT_DEVLIST='pvc'
Intel® Arc™ A-SeriesUSE_AOT_DEVLIST='ats-m150'

Note: Multiple AOT settings can be used together by seperating setting texts with a comma (,) to make the compiled wheel file have multiple AOT supports. E.g. a wheel file built with USE_AOT_DEVLIST='ats-m150,pvc' has both ats-m150 and pvc AOT enabled.

+

Intel® Extension for PyTorch* enables AOT compilation for Intel GPU target devices in prebuilt wheel files. Intel® Data Center GPU Flex Series 170 and Intel® Data Center GPU Max Series are the enabled target devices in current release, with Intel® Arc™ A-Series GPUs having prototype support. If Intel® Extension for PyTorch* is executed on a device which is not pre-configured in USE_AOT_DEVLIST, this application can still run because JIT compilation will be triggered automatically to allow execution on the current device. It causes additional compilation time during execution.

+

For more GPU platforms, please refer to Use AOT for Integrated Graphics (Intel GPU).

+
+
+

Requirement

+

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver must be installed beforehand to use AOT compilation. Once USE_AOT_DEVLIST is configured, Intel® Extension for PyTorch* will provide -fsycl-targets=spir64_gen option and -Xs "-device ${USE_AOT_DEVLIST}" option for generating binaries that utilize Intel® oneAPI Level Zero backend.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/technical_details/ipex_optimize.html b/xpu/2.3.110+xpu/tutorials/technical_details/ipex_optimize.html new file mode 100644 index 000000000..e8114cecc --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/technical_details/ipex_optimize.html @@ -0,0 +1,207 @@ + + + + + + + ipex.optimize Frontend API — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

ipex.optimize Frontend API

+

The ipex.optimize API is designed to optimize PyTorch* modules (nn.modules) and specific optimizers within Python modules. Its optimization options for Intel® GPU device include:

+
    +
  • Automatic Channels Last

  • +
  • Fusing Convolutional Layers with Batch Normalization

  • +
  • Fusing Linear Layers with Batch Normalization

  • +
  • Replacing Dropout with Identity

  • +
  • Splitting Master Weights

  • +
  • Fusing Optimizer Update Step

  • +
+

The original python modules will be replaced to optimized versions automatically during model execution, if ipex.optimize is called in the model running script.

+

The following sections provide detailed descriptions for each optimization flag supported by XPU models on Intel® GPU. For CPU-specific flags, please refer to the API Docs page.

+
+

Automatic Channels Last

+

By default, ipex.optimize checks if current running GPU platform supports 2D Block Array Load or not. If it does, the Conv*d and ConvTranspose*d modules inside the model will be optimized for using channels last memory format. Use ipex.enable_auto_channels_last or ipex.disable_auto_channels_last before calling ipex.optimize to enable or disable this feature manually.

+
+
+

conv_bn_folding

+

This flag is applicable for model inference. Intel® Extension for PyTorch* tries to match all connected nn.Conv(1/2/3)d and nn.BatchNorm(1/2/3)d layers with matching dimensions in the model and fuses them to improve performance. If the fusion fails, the optimization process will be ended and the model will be executed automatically in normal path.

+
+
+

linear_bn_folding

+

This flag is applicable for model inference. Intel® Extension for PyTorch* tries to match all connected nn.Linear and nn.BatchNorm(1/2/3)d layers in the model and fuse them to improve performance. If the fusion fails, the optimization process will be ended and the model will be executed automatically in normal path.

+
+
+

replace_dropout_with_identity

+

This flag is applicable for model inference. All instances of torch.nn.Dropout will be replaced with torch.nn.Identity. The Identity modules will be ignored during the static graph generation. This optimization could potentially create additional fusion opportunities for the generated graph.

+
+
+

split_master_weight_for_bf16

+

This flag is applicable for model training. The optimization will be enabled once the following requirements are met:

+
    +
  • When calling ipex.optimize, the dtype flag must be set to torch.bfloat16.

  • +
  • fuse_update_step must be enabled.

  • +
+

The optimization process is as follows:

+
    +
  • Wrap all parameters of this model with ParameterWrapper.

  • +
  • Convert the parameters that meet the condition specified by ipex.nn.utils._parameter_wrapper.can_cast_training. This includes the original dtype torch.float, and module types defined in ipex.nn.utils._parameter_wrapper.IPEX_WEIGHT_CONVERT_MODULE_XPU.

  • +
  • Convert the parameters wrapped by ParameterWrapper to the user-specified dtype. If split master weight is needed, the optimizer can only be SGD. The original parameters will be divided into top and bottom parts. The top part will be used for forward and backward computation. When updating weights, both the top and bottom parts will be updated simultaneously.

  • +
+
+
+

fuse_update_step

+

This flag is used to specify whether to replace the original optimizer step with a fused step for better performance. The supported optimizers can be referenced from IPEX_FUSED_OPTIMIZER_LIST_XPU in ipex.optim._optimizer_utils. During the optimization, the original step is saved as optimizer._original_step, optimizer.step is replaced with a SYCL-written kernel, and the optimizer.fused parameter is set to True.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/technical_details/memory_management.html b/xpu/2.3.110+xpu/tutorials/technical_details/memory_management.html new file mode 100644 index 000000000..15b70b3ef --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/technical_details/memory_management.html @@ -0,0 +1,167 @@ + + + + + + + Memory Management — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Memory Management

+

You can use memory_allocated() and +max_memory_allocated() to monitor memory occupied by +tensors, and use memory_reserved() and +max_memory_reserved() to monitor the total amount of memory +managed by the caching allocator. Calling empty_cache() +releases all unused cached memory from PyTorch so that those can be used +by other GPU applications. However, the occupied GPU memory by tensors will not +be freed so it can not increase the amount of GPU memory available for PyTorch.

+

For more advanced users, we offer more comprehensive memory benchmarking via +memory_stats(). We also offer the capability to capture a +complete snapshot of the memory allocator state via +memory_snapshot(), which can help you understand the +underlying allocation patterns produced by your code.

+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/xpu/2.3.110+xpu/tutorials/technical_details/optimizer_fusion_gpu.html b/xpu/2.3.110+xpu/tutorials/technical_details/optimizer_fusion_gpu.html new file mode 100644 index 000000000..337174799 --- /dev/null +++ b/xpu/2.3.110+xpu/tutorials/technical_details/optimizer_fusion_gpu.html @@ -0,0 +1,192 @@ + + + + + + + Optimizer Fusion on GPU — Intel&#174 Extension for PyTorch* 2.3.110+xpu documentation + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Optimizer Fusion on GPU

+
+

Introduction

+

As with TorchScript, operation fusion reduces the number of operators that will be executed, and reduces overhead time. This methodology is also applied in Intel® Extension for PyTorch* optimizer optimization. We support SGD/Adam/AdamW/Lamb/Lars fusion for both FP32/BF16 at current stage.

+

Let’s examine the code in sgd update as an example.

+

+    # original version
+    if weight_decay != 0:
+        grad = grad.add(param, alpha=weight_decay)
+    if momentum != 0:
+      buf = momentum_buffer_list[i]
+      if buf is None:
+          buf = torch.clone(grad).detach()
+          momentum_buffer_list[i] = buf
+      else:
+          buf.mul_(momentum).add_(grad, alpha=1 - dampening)
+    if nesterov:
+        grad = grad.add(buf, alpha=momentum)
+    else:
+        grad = buf
+
+    param.add_(grad, alpha=-lr)
+
+
+
+
+

Operation Fusion

+

One problem of the native implementation above is that we need to access the storages of grad, param, and buf several times. For large topologies, grad and parameters might not be stored in the cache. When we need to access the storage of grad again when executing the remaining clauses, the processor must read data out of low speed memory again instead of the more efficient high speed cache. This is a memory-bound bottle neck preventing good performance.

+

Operation fusion is a way to solve this problem. The clauses in the pseudo code are all element-wise operations, so we can fuse them into a single operation, as in the pseudo code below.

+
   # fused version
+   sgd_fused_step(param, grad, buf, ...(other args))
+
+
+

After fusion, one operation sgd_fused_step can provide equivalent functionality but much better performance compared with original version of sgd update.

+
+
+ + +
+
+
+ +
+ +
+

© Copyright .

+
+ + Built with Sphinx using a + theme + provided by Read the Docs. + +

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (OBSD), http://opensource.org/licenses/0BSD.
+ + +
+
+
+
+
+ + + + \ No newline at end of file