Skip to content
Eric Kern edited this page Feb 27, 2022 · 16 revisions

Transpose 2D

Contents

In this project, we implement parallel versions of a 2d-matrix-transpose algorithm with OpenMP and TBB.

A 2d matrix transposition changes the rows and columns of an arbitrary matrix. The main computation in a transposition consists of the calculation of the output element position.

We want to test our implementations in two scenarios:

  • a memory-bound scenarios in which we want to hit peak memory performance
  • a computational-bound artificial scenarios in which we generate input values instead of loading them from memory

First of all, we want to give a short description of the used frameworks OpenMP and TBB. Then we analyze the transpose algorithm and follow up with specific implementations. After that, we present benchmark results in the two test scenarios on two different hardware setups.

Frameworks

OpenMP API

One of our used frameworks for parallelization is OpenMP. It is a compiler-based approach which relies on preprocessor directives to identify parallel code sections. OpenMP is based on a fork-join model to create threads and uses work sharing to distribute parallel work. Nested parallelism is also supported, but too many threads can cause oversubscription of the available resources.

TBB API

TBB is a software library initially developed by Intel that provides parallel constructs without relying on special languages or compilers. This also means that code written with TBB can be easily ported to many processors and operating systems. Nested parallelism is supported as well but in contrast to OpenMP, TBB uses work-stealing as a load balancing concept. The problem of oversubscription is highly reduced with TBBs' task concept. Instead of actual threads, TBB allows the programmer to specify logical parallelism that TBB maps internally to a certain number of threads.

Problem analysis

In a matrix transposition, the actual computations a computing device has to perform are only a few arithmetic integer calculations to determine the position of an output element. Hence the computational intensity of our problem is extremely low. In this regard, the problem is memory-bound which means memory optimizations are the most promising candidate for performance optimizations also in a parallel approach.

In the present case, we always use an input and an output matrix and don't consider an in-place transposition.

Our matrices store the values in a row-major pattern which means each row is consecutively stored after each other in a one-dimensional array. An example can be seen in the following picture.

drawing

This means in a matrix transposition accessing consecutive elements in a row of the input matrix results in a memory access with a stride in the output matrix. Thus we can only read or write in one of the matrices in a contiguous access pattern. In the other one, the stridden access will lead to an expensive memory access if already fetched cache lines get evicted before the other elements of that cache line can be processed (capacity or conflict misses).
Therefore the biggest performance improvements can be expected from optimizing the access pattern to leverage data locality for better cache utilization.

The default access pattern can be seen in the following picture which shows the elements loaded with cache lines represented by the green boxes. When advancing to the next row of matrix a the elements in the already fetched cache lines from matrix b can be used. This behavior can be observed only for small matrix dimensions. If too many cache lines have to be loaded for one row the first ones get evicted and no reuse of the cached data is possible.

drawing

A widely used approach to increase cache locality is the concept of tiling (also called loop blocking). During tiling, the matrix is subdivided into smaller blocks (so-called tiles) which are individually processed one after each other. Inside each block, each row is still separated in memory and switching among rows still results in a stridden access.
However, with tiling, the neighbouring elements in a row can be reused from the same cache line because it stays in the cache if the dimensions of a tile are chosen appropriately. The tile size must be chosen according to the actual cache size. Only then no cache line gets evicted during a single block execution. This optimized access pattern using tiles is shown in the following picture.

drawing

In a cache-aware implementation, the tile size has to be determined by running multiple tests on a specific hardware setup depending on the actual cache size. This limits the portability of the algorithm.

However in our case, also a generic cache-oblivious implementation exists which determines the tile size automatically. In the present case, our matrix is recursively subdivided into smaller tiles until they fit in the cache. A more detailed explanation of our cache-oblivious implementation can be found in the TBB section.

Formal Criteria

To meet the requirements of the exercise and have a similar interface to the STL algorithms, we use iterators as our function arguments. Also, as we implement general patterns independent of the actual data we transpose, we stick to a template functions so they work generically.

Serial Implementation

To have a baseline that is based on an STL-Algorithm we use the for_each() function for the transposition.

template <class InIter, class OutIter>
void stl_like(InIter inStart, InIter inEnd, OutIter outStart, const size_t in_rows){
    const size_t in_columns = std::distance(inStart, inEnd) / in_rows;
    size_t i = 0;
    using InElemType = typename std::iterator_traits<InIter>::value_type;
    std::for_each(inStart, inEnd, [&](InElemType &inElem){
        *outStart = inElem;     
        i++;
        if(i % in_columns == 0)
        {
            std::advance(outStart, -(in_columns-1)*in_rows);  // jump back into first row
            ++outStart;                                       // incrase column by 1
        }
        else
        {
            std::advance(outStart, in_rows);        // step down a row in the Output Matrix
        }
    });
}

In this implementation, we go through every row in the input matrix and write the values column-wise to the output matrix. This pattern results in a contiguous read and a stride write access to memory.

We see that the calculation of the iterator position is quite complicated for a rather simple algorithm like a matrix transposition. With the for_each() function we iterate over a one-dimensional index space even though we operate on two dimensions. The other significant problem of for_each() is that it dereferences the value of the iterator so only the element itself is visible inside the lambda function but not its index. But it is exactly the index that is necessary to derive the output position of the element. This is also the reason why we need the count variable i in the code so we can determine when we reach the last row and when to jump back to the first row. The use of this incrementing variable also limits the use of a parallel execution policy with the for_each() function as the value of i is only valid in a serial execution. Therefore parallel execution is not possible with this implementation. Additionally, the complicated iterator shifts make prefetching more difficult and may reduce performance.

As a remedy for this flaw, we used a self-written iterator that produces linearly increasing integer values with a step size of 1 to receive indices as a function argument in the lambda function. With these, we can calculate the row and column of the current element and have all the necessary information to perform the transposition. Additionally, the loop iterations are independent and the algorithm can be parallelized. In our measurements, we will only look at the parallelizable for_each() implementation. All implementations can be found in the repository file inc/transpose.hpp.

Compared to the rather complex iterator shift in the for_each() version a serial version with two nested for loops which iterate over both dimensions of the matrix seems much more suitable. Therefore we use the following code as a serial baseline in the plots.

// Serial implementation contiguous write
template <class InIter, class OutIter>
void transposeSerial(InIter inStart, OutIter outStart, OutIter outEnd, const size_t in_rows){
    const size_t n_elem = std::distance(outStart, outEnd); 
    const size_t in_columns = n_elem / in_rows;
    for(size_t x = 0; x < in_columns; ++x){
        for(size_t y = 0; y < in_rows; ++y){
            *(outStart + (x * in_rows + y)) = *(inStart + (y * in_columns + x));
        }
    }
}

TBB

For our next implementation we use Thread Building Blocks for a parallel solution. Our basic TBB implementation can be seen in the following code snipped.

template <class InIter, class OutIter, class Partitioner>
void tbb(InIter inStart, OutIter outStart, OutIter outEnd, const size_t in_rows, size_t gs, Partitioner& part){
    const size_t n_elem = std::distance(outStart, outEnd); 
    const size_t in_columns = n_elem / in_rows;
    
    oneapi::tbb::parallel_for(
        oneapi::tbb::blocked_range2d<size_t>(0, in_rows, gs, 0, in_columns, gs),
        [&](oneapi::tbb::blocked_range2d<size_t> r){
        // transpose subblock of range2d
        for(size_t x = r.cols().begin(); x < r.cols().end(); ++x){
            for(size_t y = r.rows().begin(); y < r.rows().end(); ++y){
                *(outStart + (x * in_rows + y)) = *(inStart + (y * in_columns + x));
            }
        }
    },part
    );
}

The parallel_for() function receives a blocked_range2d object according to the actual matrix dimensions. This object then gets internally subdivided depending on the chosen scheduling policy. In our case, we use the simple_partitioner for our cache-oblivious solution to divide the matrix in blocks of the specified grainsize. However, the grainsize is only an upper bound (in combination with the simple partitioner) thus the actual size of the subblocks varies in order to fully divide arbitrary matrices. Then we iterate over the rows and columns of these tiles and store the input-values in the corresponding output position.

To achieve the cache-oblivious behavior we use the simple_partitioner with the blocked_range2d object. The range object describes the indices of our input matrix and gets recursively split by the simple_partitioner along the bigger dimension until the defined grainsize is reached (see picture below). Although we previously mentioned that the cache-oblivious algorithm automatically finds an optimal tile size we have to specify a termination condition that definitely fits into the first-level cache.

drawing

The internal work-stealing behavior of the TBB scheduler leads to a tile execution similar to the pattern shown in the following picture. This traversal pattern leads generically to a good data access pattern because even if the tile size dimension is smaller than a cache line the data is reused in one of the following tiles.

drawing

For further details see Tuning TBB Algorithms: Granularity, Lacality, Prallelism and Determinism in Intel's book "Pro TBB" from 2019.

TBB SIMD

An optimization to the previously described cache-oblivious version can be achieved by using SIMD instructions for the transposition of the submatrices in every tile. In a first approach, we used OpenMPs #pragma omp simd directive. However, this didn't change anything in the execution time. After several variations of the additional clauses like simdlen() we continued by using true vector intrinsics. As both of our machines support 256bit wide vector registers we decided to use the AVX-Intrinsics which allows us to transpose 8x8 blocks in a single subroutine. With this implementation, we create additional tiling in each submatrix which results in even smaller blocks that fit in higher cache levels. Additionally, we highly reduce instruction overhead as we transpose an 8x8 block in one loop iteration. Hence we have an effect similar to loop unrolling.

In this implementation, we statically divide the submatrices in 8x8 blocks what might lead to a border region that cannot be transposed as a full 8x8 block. Therefore we have to manually check if these regions exist and if so transpose these elements one by one like in the version without the vector intrinsics.

In the following, we have a code snippet of the function that transposes a 8x8 block with AVX vector intrinsics taken from https://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2.

void tran(float* mat, float* matT, size_t in_w, size_t out_w) {
  __m256  r0, r1, r2, r3, r4, r5, r6, r7;
  __m256  t0, t1, t2, t3, t4, t5, t6, t7;

  r0 = _mm256_load_ps(&mat[0*in_w]);                     //  1  2  3  4  5  6  7  8 
  r1 = _mm256_load_ps(&mat[1*in_w]);                     //  9 10 11 12 13 14 15 16 
  r2 = _mm256_load_ps(&mat[2*in_w]);                     // 17 18 19 20 21 22 23 24 
  r3 = _mm256_load_ps(&mat[3*in_w]);                     // 25 26 27 28 29 30 31 32 
  r4 = _mm256_load_ps(&mat[4*in_w]);                     // 33 34 35 36 37 38 39 40 
  r5 = _mm256_load_ps(&mat[5*in_w]);                     // 41 42 43 44 45 46 47 48 
  r6 = _mm256_load_ps(&mat[6*in_w]);                     // 49 50 51 52 53 54 55 56 
  r7 = _mm256_load_ps(&mat[7*in_w]);                     // 57 58 59 60 61 62 63 64 


  t0 = _mm256_unpacklo_ps(r0, r1);                       //  1  9  2 10  5 13  6 14 
  t1 = _mm256_unpackhi_ps(r0, r1);                       //  3 11  4 12  7 15  8 16 
  t2 = _mm256_unpacklo_ps(r2, r3);                       // 17 25 18 26 21 29 22 30 
  t3 = _mm256_unpackhi_ps(r2, r3);                       // 19 27 20 28 23 31 24 32 
  t4 = _mm256_unpacklo_ps(r4, r5);                       // 33 41 34 42 37 45 38 46 
  t5 = _mm256_unpackhi_ps(r4, r5);                       // 35 43 36 44 39 47 40 48 
  t6 = _mm256_unpacklo_ps(r6, r7);                       // 49 57 50 58 53 61 54 62 
  t7 = _mm256_unpackhi_ps(r6, r7);                       // 51 59 52 60 55 63 56 64 


  r0 = _mm256_shuffle_ps(t0,t2,_MM_SHUFFLE(1,0,1,0));    //  1  9 17 25  5 13 21 29 
  r1 = _mm256_shuffle_ps(t0,t2,_MM_SHUFFLE(3,2,3,2));    //  2 10 18 26  6 14 22 30
  r2 = _mm256_shuffle_ps(t1,t3,_MM_SHUFFLE(1,0,1,0));    //  3 11 19 27  7 15 23 31 
  r3 = _mm256_shuffle_ps(t1,t3,_MM_SHUFFLE(3,2,3,2));    //  4 12 20 28  8 16 24 32 
  r4 = _mm256_shuffle_ps(t4,t6,_MM_SHUFFLE(1,0,1,0));    // 33 41 49 57 37 45 53 61 
  r5 = _mm256_shuffle_ps(t4,t6,_MM_SHUFFLE(3,2,3,2));    // 34 42 50 58 38 46 54 62 
  r6 = _mm256_shuffle_ps(t5,t7,_MM_SHUFFLE(1,0,1,0));    // 35 43 51 59 39 47 55 63 
  r7 = _mm256_shuffle_ps(t5,t7,_MM_SHUFFLE(3,2,3,2));    // 36 44 52 60 40 48 56 64 


  t0 = _mm256_permute2f128_ps(r0, r4, 0x20);             //  1  9 17 25 33 41 49 57
  t1 = _mm256_permute2f128_ps(r1, r5, 0x20);             //  2 10 18 26 34 42 50 58 
  t2 = _mm256_permute2f128_ps(r2, r6, 0x20);             //  3 11 19 27 35 43 51 59
  t3 = _mm256_permute2f128_ps(r3, r7, 0x20);             //  4 12 20 28 36 44 52 60 
  t4 = _mm256_permute2f128_ps(r0, r4, 0x31);             //  5 13 21 29 37 45 53 61 
  t5 = _mm256_permute2f128_ps(r1, r5, 0x31);             //  6 14 22 30 38 46 54 62 
  t6 = _mm256_permute2f128_ps(r2, r6, 0x31);             //  7 15 23 31 39 47 55 63 
  t7 = _mm256_permute2f128_ps(r3, r7, 0x31);             //  8 16 24 32 40 48 56 64 

  _mm256_store_ps(&matT[0*out_w], t0);
  _mm256_store_ps(&matT[1*out_w], t1);
  _mm256_store_ps(&matT[2*out_w], t2);
  _mm256_store_ps(&matT[3*out_w], t3);
  _mm256_store_ps(&matT[4*out_w], t4);
  _mm256_store_ps(&matT[5*out_w], t5);
  _mm256_store_ps(&matT[6*out_w], t6);
  _mm256_store_ps(&matT[7*out_w], t7);
}

In the first instruction block, we load the 8x8 block in the vector registers with a 256 bit load instruction for every single row. Then we have to unpack these values and split them into new registers for the shuffle instruction. The shuffle instruction takes two registers as input and picks elements symmetrically from the two input vectors. In that sense, it's a specialization of a gather instruction. The control word consists of two gather positions from the first input vector and two for the second input vector. The 4-value gather operation is performed on the upper half of the input and the output registers and also on the lower half of the input and output registers.

After that, we have to rearrange the values of the upper and lower half registers with the _mm256_permute2f128_ps instructions to achieve the last step of the transposition. The instruction again takes two input registers and a control word that specifies the gather positions. The instruction selects a 128 bit chunk respectively 4 float values from the given position in the control word and stores them in the output register.

NUMA Aware Data Placement

Even though the cache-oblivious implementation can be used on different machines with different cache sizes it doesn't allow a NUMA aware data placement. As we rely on the simple_partitioner to divide the matrix into subblocks of the specified grainsize we cannot determine which thread accesses which data because of the work-stealing scheduling policy of TBB.

The static_partitioner would allow a deterministic data access for every thread and hence would allow a NUMA aware data placement but then our subblocks won't fit in cache anymore as the chunk size for each thread is as big as possible. Also, the traversal pattern is lost with this scheduling policy.

The use of the affinity_partitioner is also impractical as we lose knowledge of the chunk size at all. Here, the specified grainsize is only a lower bound and thus we cannot guarantee that each submatrix actually fits in cache.

OpenMP

As a first approach with OpenMP, we start by parallelizing the naive serial implementation. We iterate over the rows and columns of the matrices in two nested for loops and parallelize the outer one with the #pragma omp parallel for directive. To create a contiguous write access in the output matrix we step through the input matrix column by column in the inner loop. We only parallelize the outer loop because otherwise, the individual work for one thread is too small compared to the forking overhead.

template <class InIter, class OutIter>
void openMP(InIter inStart, OutIter outStart, OutIter outEnd, const size_t in_rows){
    const size_t n_elem = std::distance(outStart, outEnd); 
    const size_t in_columns = n_elem / in_rows;

    #pragma omp parallel for schedule(TRANSPOSE_POLICY)
    for(size_t x = 0; x < in_columns; ++x){
        for(size_t y = 0; y < in_rows; ++y){
            *(outStart + (x * in_rows + y)) = *(inStart + (y * in_columns + x));
        }
    }
}

As described in problem analysis we want to leverage data locality to increase performance. Hence we want to introduce tiling in our OpenMP version as well. However, as OpenMP doesn't provide a mechanism to automatically subdivide the problem we create tiles manually and therefore implement a cache-aware solution.

OpenMP Tiled

To introduce tiling in our OpenMP version, we add two additional loops in our code. In the outer two loops we iterate over the blocks in both dimensions and in the inner two loops we iterate over the rows and columns of the submatrices like in the naive version. In this case, we have to determine the block size experimentally for every machine as we have to trim this parameter for the individual cache sizes.

The following code snippet shows the algorithm for the regular blocks.

template <class InIter, class OutIter>
void openMPTiled(InIter inStart, OutIter outStart, OutIter outEnd, const size_t in_rows, size_t blocksize){
    const size_t n_elem = std::distance(outStart, outEnd); 
    const size_t in_columns = n_elem / in_rows;
    
    // calculate regular blocks
    #pragma omp parallel for schedule(TRANSPOSE_POLICY)
    for (size_t xx = blocksize; xx <= in_columns; xx += blocksize) {
        for (size_t yy = blocksize; yy <= in_rows; yy += blocksize) {
            // transpose the block beginning at [xx,yy]
            for (size_t x = xx - blocksize; x < xx; ++x) {
                for (size_t y = yy - blocksize; y < yy; ++y) {
                    *(outStart + (x * in_rows + y)) = *(inStart + (y * in_columns + x));
                }
            }
        }
    }

    // calculate irregular blocks
    // ...
}

Because of the static block size, we have the same problem as in the TBB SIMD version. Here we also have to check if the matrix is completely transposable with the given block size. Otherwise we have to take care of the border regions and transpose the remaining elements individually.

Another thing that comes to mind is parallelizing additional loops in this implementation as we have doubled the number of for-loops. It is possible to collapse the two outermost for-loops and schedule blocks to each thread. But this only reduced the achieved performance so we contiue without loop collapsing. Further parallelization is not feasible as we would break up our tiles and thus would lose the desired locality.

OpenMP Tiled SIMD

Also in our OpenMP version, we consider vectorization to be beneficial and therefore want to use SIMD instructions. However, just like in the TBB version, the use of the #pragma omp simd directive doesn't change much in performance. Hence, we use the AVX Intrinsics again in the same manner as in the TBB version.

The following code snippet shows the transposition algorithm for the regular blocks.

template <class InIter, class OutIter>
void openMPIntrin(InIter inStart, OutIter outStart, OutIter outEnd, const size_t in_rows, size_t blocksize){
    //blocksize can only be a multiple of 8
    const size_t n_elem = std::distance(outStart, outEnd); 
    const size_t in_columns = n_elem / in_rows;
    
    #pragma omp parallel for schedule(TRANSPOSE_POLICY)
    for (size_t xx = blocksize; xx <= in_columns; xx += blocksize) {
        for (size_t yy = blocksize; yy <= in_rows; yy += blocksize) {
            for (size_t x = xx-blocksize; x < xx; x+=8){
                for (size_t y = yy-blocksize; y < yy; y+=8){
                    tran(&(*(inStart + (y * in_columns + x))),
                         &(*(outStart + (x * in_rows + y))),
                         in_columns,
                         in_rows);
                }
            }           
        }
    }

    // calculate irregular blocks
    // ...
}

Due to the static size of the SIMD blocks it can happen that the tile cannot be completely transposed if the tile width is not a full multiple of 8. For simplification, we just allow block sizes that are multiples of 8.

NUMA Aware Data Placement

By using the static scheduling policy of OpenMP, we have a deterministic assignment of threads to data elements and can therefore implement a NUMA aware data placement. With regard to the first-touch policy, we initialize the data in the same order in which we execute the transposition. This means the loop structure, as well as individual start and end values during the initialization, correspond to the loops in the examples.

Furthermore, we use a custom allocator to put the initial data into the NUMA-regions of the actual thread that writes the data for the first time.

Results

In the following section, we want to visualize and interpret our benchmark results. We distinguish between memory and computational performance on our two test systems.

Memory Benchmarks

For the out of place transposition, we have to load the input matrix and store the output matrix. This results in 2 * matrix_width * matrix_heigth * sizeof(float) moved bytes. As we just reorder the input elements and store the same number of elements in the output matrix our algorithm behaves close to a copy instruction. Therefore we use the copy-benchmark of the likwid benchmarking application to measure the maximum achievable memory bandwidth for each of our two test systems. The results of the benchmark can be seen in the following table.

likwid benchmark Media Rome
copy 18.2 GB/s 98.9 GB/s

These results correspond to our own measured copy performance. Since the memory benchmark uses a perfectly aligned memory access pattern we expect a lower memory performance in the transposition algorithm. However, the copy benchmark serves as a roofline for our implementation.

Media

Test Setting:
For the following memory benchmarks we always use square matrices starting from a matrix width of 25 up to a width of 215 with single-precision floating-point values as elements. The Media node has a UMA architecture and features an Intel Xeon CPU E3-1585 v5 with 4 physical and 8 logical cores. It has 32KB L1 cache, 256KB L2 cache and a shared L3 cache with 8MB.
We also include the size of the cache levels in the following plots by drawing vertical lines at total_cache_size / sizeof(float). It must be noted that it's unlikely due to cache conflicts that the whole cache size can always be used.

STL Version

As our very first approach, we wanted to see how our parallel STL versions perform compared to a serial baseline. This can be seen in the following plot. In the performance plot, we measured the achieved bandwidth over an increasing problem size.
Because transposition with the parallel for_each() was not straightforward and needed adjustments to make it work in parallel we didn't expect performance gains compared to the serial version. It can be seen that the parallel versions are even with big problem sizes much slower than the serial implementation.

drawing

Basic Versions

In the following we present our results of the simple OpenMP and TBB implementation. To have a comparison we also include the achieved performance of the serial version. We first look at the results of the Media node as we want to describe only algorithmic optimizations without the influence of NUMA effects. Those will be introduced later on the Rome node.

For OpenMP we used dynamic scheduling and for TBB the simple partitioner with a grainsize of 64. Static scheduling lead to worse results which can be seen in Appendix A.

drawing

As expected, we can see that the serial implementation clearly performs worse than all other parallel solutions. The vertical lines for L1 and L2 are calculated considering the caches of every core so only the L3 line is applicable for the serial version. The performance of the serial implementation drops after the problem no longer fits into L3. We assume that L1 and L2 have little impact on the performance because of little to no spatial locality in our naive implementation. The OpenMP version clearly sees a performance improvement over the serial version. Especially when the problem fits into cache. However, the moment the problem size exceeds the last level cache the performance significantly drops and it performs only slightly better than the serial implementation. In this version, we increased the used cores but we still don't create locality in our code.

In contrast, the TBB version performs much better than the OpenMP version also when exceeding the cache size. As described in the TBB section the TBB version uses a cache-oblivious approach to use locality independent of the actual cache size of the architecture. Therefore the drop after L3 is not that big compared to the other versions.

In all our implementations, we read the elements of the input matrix with a stride to have a contiguous write access for the output matrix. Here we included a version of TBB that switches the memory access pattern around so the contiguous memory access is in the input matrix. But it can be seen that this version performs worse than with a contiguous write access. Therefore we stick to a contiguous write pattern in all other implementations.

Tiled Versions

As described in the OpenMP Tiled section we want to reuse elements that are already fetched with a cache line to maximize bandwidth performance. In the following plot, we can see the results of this version.

We use a tile width of 64 elements similar to the defined grainsize of the blocked_range2d used in the TBB version. We still use the dynamic scheduling of OpenMP and have included the TBB version from the last plot for comparison.

drawing

We can see that the OpenMP tiled version achieves similar performance to the TBB version. So the introduced changes have the desired effect. In both algorithms, we are already close to peak performance. With 14.2 GB/s for OpenMP and TBB, we reach around 75% of the peak performance.

In addition to the TBB version and the OpenMP tiled version we also included a SIMD version for each of the algorithms. In these, we used the #pragma omp simd directive to vectorize the inner loop. Unfortunately, the compiler vectorization doesn't lead to a performance increase as the performance almost stays the same.

Intrinsic Versions

As the use of OpenMPs' SIMD directive yields no additional performance we implemented versions for OpenMP and TBB as described in TBB SIMD by using actual AVX-intrinsics.

By using the Intrinsics we found a tile width respectively a grainsize of 128 elements to be optimal. An analysis regarding these parameters can be found in the following section. As a reminder, we still use square matrices with the dynamic scheduling policy in OpenMP and the simple_partitioner in TBB.

The following plot shows the memory performance of the intrinsics implementations. For comparison, we included the tiled versions without the AVX-intrinsics.

drawing

We can see a significant performance increase as long as the problem fits into L3 but there is also a small increase in the sustained bandwidth after the problem no longer fits into cache compared to the versions without vectorization.

With our last optimization, we achieved a sustained bandwidth of 15.8 GB/s which translates to 87% peak performance.

Tile Size Analysis

The presented solutions in the previous sections use different tile widths for optimal performance. We determined these experimentally, the results can be seen in the following plot.

drawing

In this scenario we no longer vary the problem size but chose a fixed matrix size of 214 x 214 elements to measure the sustained bandwidth. Instead of the number of transposed elements, the horizontal axis now shows the tile width respectively the grainsize. The shape of a tile is still quadratic.

It can be seen that non vectorized versions of OpenMP and TBB are very sensitive with regards to the chosen tile size. The curve of the OpenMP tiled version drops very fast to a minimal performance when exceeding a tile width of 128. The cache-oblivious TBB version has a similar behavior but it has a plateau between a tile width of 130 and 240 during which it can maintain a mediocre performance before it further drops to its minimum. By increasing the minimum tile size of the cache-oblivious algorithm we limit its self-adapting behavior so it can no longer use the full cache hierarchy but has to use bigger and slower cache levels. For both of these implementations, we decided on a rather conservative tile width of 64 to be optimal as it yields almost the best performance and is still quite far away from a tile width where small changes in size cause severe penalties.

When looking at the vectorized implementations there are also drops in performance but these are much smaller than the ones in the non vectorized versions. The quite consistent bandwidth is not a result of using vector instructions in general but of the associated memory access pattern. Because our vectorized function transposes blocks of 8x8 elements we actually introduced a second layer of tiling as we move with the 8x8 subtiles across the original tiles. Thus the bandwidth stays quite high even after exceeding a tile width of 128.

What was still surprising for us was the extremely consistent performance of the vectorized TBB algorithm. It seems like the combination of the cache-oblivious algorithm with the second layer of tiling by the vector instructions creates a very robust algorithm which is almost independent of the minimum tile size. For our vectorized versions, we chose a tile width of 128 to be optimal.

Rome

Test Setting:
In the next test case, we keep the input data set identical. We still perform a matrix transposition using square matrices starting from a matrix width of 25 up to a width of 215 with single-precision floating-point values. We also compare our different implementations against each other but the significant change is that we use completely different hardware.

We switch from the Media node to the Rome node which has a NUMA architecture and features an AMD EPYC 7662 processor with 64 physical and 128 logical cores. It has 32KB L1 cache and 512KB L2 cache for each core and also 16 x 16MB L3 cache.

Due to the high core count, the CPU can leverage much more concurrency for performance however the NUMA architecture is an additional component to consider in possible optimizations. To get close to the new peak memory performance of 98.9 GB/s NUMA aware data placement might be essential.

At first, we want to look again at the STL implementation of our algorithm where we use the STL for_each() function. What we can see on Rome is that there is actually a break-even point where the STL algorithms with the parallel execution policy are faster than the serial version. On the Rome machine, the higher core count seems to have a positive impact on the parallel STL versions compared to the results of the benchmarks on the Media machine. However, it is just slightly faster and we are at only 1% of the system's peak performance. Therefore we decided to not include these implementations in the upcoming plots anymore.

drawing

Next, we want to try the configuration we used for our best results on the media node. The achieved performance can be seen in the following plot.

drawing

We kept the dynamic scheduling of OpenMP, the simple_partitioner for TBB as well as the tile_width of 128 for both of our AVX implementations and 64 for the other tiled versions. But as we can see the results are far from peak performance. We get these bad results because we don't initialize our data with respect to the NUMA architecture but this is one of the major differences between the two machines and must be considered. Currently, our data is initialized by a single thread and resides only in a single NUMA node. Therefore we observe high access latencies when a thread from another NUMA region accesses the data.

Rome NUMA Aware

As we could see, an implementation without NUMA-aware data placement results in bad performance so our first optimization on Rome tackles exactly this problem. Now we initialize our data with multiple threads and in the exact same pattern as we execute the transposition later on. With regards to the first-touch policy we want to initialize the data in the NUMA node where we process the data afterwards. Additionally, we introduce the use of the OpenMP Places environment variables that allow us the pinning of threads to physical cores. We got the best results by using OMP_PLACES=cores with OMP_PROC_BIND=true. Additionally, we switch from dynamic scheduling to static scheduling to get a static work assignment to threads. For TBB we can achieve the same behavior with the static partitioner but this has the disadvantage of losing control over the actual tile width as TBB uses the specified grainsize only as a lower bound when using the static partitioner. This leads to bad performance as we lose the improved cache usage of our tiling approach because the tiles don't fit into cache anymore (see TBB).

Hence we have to stick to the simple partitioner to keep control over the minimum tile width in our cache-oblivious solution. The only optimization we do for TBB is that we initialize our data also in parallel and in the same pattern as we process it later on so it's spread across all NUMA-nodes. However with TBBs work-stealing scheduling we cannot guarantee that the same thread that initialized the data also processes the data. But still, this optimization increases performance significantly.

A performance plot when using the static_partitioner can be seen in Appendix B

The results when using the simple partitioner and static scheduling for OpenMP together with OpenMP Places and a tile width of 64 can be seen in the following picture.

drawing

We can see that the performance for the AVX versions improves a lot compared to the non-NUMA-aware version but all the other implementations are still far from peak performance. We attribute the gain in performance for our AVX versions to the additional level of tiling as we are always transposing an 8 x 8 block in our vector registers.

We assume that adjusting the tile width for the other versions without the additional layer of tiling could also improve performance.

Rome Tile Width Adjustment

For the analysis of an optimal tile width on the Rome machine we again keep the problem size constant at 215 x 215 and steadily increase the tile width. For all the other parameters we stick to the NUMA-aware configuration from the last plot.

drawing

For the AVX versions with the additional layer of tiling, we see results similar to the media machine. Varying the tile width for the bigger tile has a smaller impact on performance compared to implementations with only a single layer of tiling.

An interesting observation for these versions is that it seems like a very small tile width of 8 results in by far the best performance. For the OpenMP AVX version a tile width of 8 for the outer tile is not suitable as we would lose the second layer of tiling but keep the control flow overhead.

So just like we assumed in our previous plot, we have chosen the wrong tile width. For the single-layer tiling implementations, very small tiles are beneficial for both TBB and OpenMP. The results when using 64x64 tiles for the AVX versions and 8x8 blocks for all other versions can be seen in the following plot.

drawing

Now we see a significant performance increase also in our non-AVX-versions. Especially the cache-oblivious versions benefit from the smaller tiling and perform almost as good as the AXV versions. An adjustment of the chunksize didn't increase performance which can be seen in Appendix B.

TBB Task Arena

As we described in the section about Numa aware data placement our cache-oblivious version with TBB relies on the simple partitioner therefore we cannot ensure that each thread operates on data that is placed in its corresponding NUMA node. As an additional improvement, we implemented a version using hwloc which gives us further control over the memory system and allows us to allocate memory in each NUMA region. With TBBs task_arena we can now create a task arena for each NUMA region which allows us to compute a subproblem independently in each NUMA region. We achieve this by pinning threads to a specific NUMA region so they can only operate on data that is initialized there.

The computation on each subproblem in each NUMA node is performed in the same way as we described it in the TBB AVX section. With this step, we introduce a third layer of tiling in the algorithm. However, by splitting the matrix and distributing the computation over the available NUMA regions we no longer get a contiguous matrix but in our case 4 independent submatrices.

This no longer full fills the specified requirements but as we can see in the following plot we get the best performance with this additional optimization so we want to include these results.

drawing

We see that the hwloc version yields overall the best results for big problem sizes. It performs poorly for small matrices because of the additional overhead.

Computational Benchmark

In the previous sections, we looked at the performance of our implementation with regards to the achievable memory bandwidth as we are clearly memory bound in this application just as described in the problem analysis. Now we want to analyze also the computational performance assuming we remove our memory limitations.

So now instead of reading data from memory we analytically generate the input values and drastically reduce the memory traffic by writing only bits in the output vector. Therefore we wanted to use the STL bool vector but according to cpp-reference it "does not guarantee that different elements in the same container can be modified concurrently by different threads". This is also what we observed in our tests. Because of race conditions among threads, the results of the transposition are sometimes wrong. Additionally, the performance gets worse when we use the bool vector compared to using a float vector for the output. Actual measurements can be found in Appendix C.

Therefore we stick to float values for our output but still calculate the input values. This also comes with some restrictions as we no longer have input values in memory our AVX versions cannot fill the vector registers with a SIMD load instruction anymore. To overcome this problem we have to fill local arrays with the generated input values and load these values from there. This does come with additional overhead but as we want to compare our results to the memory benchmark we want to modify our algorithms as little as possible.

Instead of measuring memory bandwidth in GB/s we now measure giga updates per second where one update corresponds to one transposed element of our matrix. For one update in the OpenMP basic version without tiling we need on average:

  • two compare instructions for the nested loops
  • two increment instructions
  • 6 arithmetic instructions to calculate the positions of the element

According to compiler explorer, this translates to 32 assembler instructions.

The results of our measurements on the Media node with the best configuration from the memory benchmark are depicted in the following plot. Again we increased the problem size from a matrix size of 25 x 25 to a matrix size of 215 x 215. For OpenMP we use dynamic scheduling with a tile width of 64 respectively 128 for the AVX version. For TBB we still use the simple partitioner with a grainsize of 64 respectively 128 for the AVX version.

drawing

We can see that in the computational benchmark the OpenMP basic version without tiling performs best. This can be explained by the new memory access pattern. The algorithm was designed to write contiguously to the output matrix and read with a stride from the input. But as we now generate our input values and no longer read from memory we have a perfect memory access pattern without a stride. Our optimizations from the other versions especially tiling now just create unnecessary overhead in the computation. The tiling even worsens our performance as it introduces a stride in the output without the increase of locality in the input. Also, we have additional instruction overhead from tiling.

Also, we see the impact of the register fill phase for the AVX versions as they perform now significantly worse than the other parallel version.

The achieved giga updates of the basic OpenMP version corresponds to the maximum memory write performance on the Media machine.

For Rome, we see similar behavior. With the best configuration from the memory benchmark, we get the results of the following plot.

drawing

We use again the static scheduling for OpenMP and the simple partitioner for TBB with a tile width or respectively a grainsize of 8 for all parallel non-AVX versions. For the AVX versions with the second layer of tiling, we use a tile width or a grainsize of 64. Also, we do a NUMA aware data placement for the output vector and use OpenMP Places.

Here the results are very similar to the results on the Media machine. Once again the OpenMP basic version performs best and we are very close to the write peak performance of the Rome machine.

Overall we have to admit that the computational benchmark doesn't provide additional insights because we are still memory bound. Unfortunately, the bool vector doesn't change this fact. This makes it in our case also impossible to reason about the computational workload of our algorithms.

Appendix A

drawing

Media machine:

Static scheduling for OpenMP, static partitioner for TBB. Constant Matrix size of 214 x 214.

drawing

Media machine:

Static scheduling for OpenMP, static partitioner for TBB, tile width/grain size of 128 for AVX versions respectively 64 for all other versions.

Appendix B

drawing

Rome machine:

Static scheduling for OpenMP, static partitioner for TBB. Constant Matrix size of 215 x 215.

drawing

Rome machine:

Static scheduling for OpenMP, static partitioner for TBB, tile width/grain size of 64 for AVX versions respectively 8 for all other versions.

drawing

Rome machine:

Static scheduling for OpenMP, simple partitioner for TBB, tile width/grain size of 64 and chunksize of 1 for OpenMP.

Appendix C

drawing

Media machine: Using bool vector, dynamic scheduling for OpenMP, simple partitioner for TBB, tile width/ grain size of 128 for AVX versions respectively 64 for all other versions.

drawing

Rome machine: Using bool vector, Static scheduling for OpenMP, simple partitioner for TBB, tile width/grain size of 64 for AVX versions respectively 8 for all other versions.

Appendix D

drawing

drawing