This SYCL implementation of All-Reduce shows how to achieve peak performance on Intel PVC system with XeLinks. Single shot kernel that utilizes of all available internel (Cross Tile and XeLinks) bandwidth simultaneously is the key to achieve designed peak performance. Implementation demonstrated in half precision at the moment.
- Intel SYCL Compiler
- Most up-to-date drivers for PVC
- MPI
git submodule update --init
make main
mpirun -np 8 ./main -n \<number of elements in half\> [-w sub-group] [-g group] [-a small | simple | bisect]