From 3e1ec2fc979801ebec6b95ccf7ef17dca15e733c Mon Sep 17 00:00:00 2001 From: Jack Gallagher Date: Thu, 2 May 2024 14:16:31 +0100 Subject: [PATCH] i#5036 Update scatter/gather docs with AArch64 details (#6795) Adds details of the AArch64 scatter/gather expansion to the scatter/gather expansion developer documentation. Issue: #5036 --- api/docs/scatter_gather_emulation.dox | 334 +++++++++++++++++++++++++- 1 file changed, 322 insertions(+), 12 deletions(-) diff --git a/api/docs/scatter_gather_emulation.dox b/api/docs/scatter_gather_emulation.dox index 29263a2857b..2a6671a7e70 100644 --- a/api/docs/scatter_gather_emulation.dox +++ b/api/docs/scatter_gather_emulation.dox @@ -1,5 +1,6 @@ /* ****************************************************************************** * Copyright (c) 2010-2022 Google, Inc. All rights reserved. + * Copyright (c) 2024 Arm Limited All rights reserved. * ******************************************************************************/ /* @@ -32,12 +33,14 @@ /** **************************************************************************** -\page page_scatter_gather_emulation Emulating x86 Scatter and Gather Instructions +\page page_scatter_gather_emulation Emulating Scatter and Gather Instructions \tableofcontents # Background +## x86 + The x86 gather and scatter instructions were introduced in the AVX2 and AVX512 instruction set extensions. They allow loading or storing a subset of elements in a vector from/to multiple non-contiguous addresses. @@ -71,6 +74,127 @@ register `k1`. Elements may be scattered in any order. When an element is stored its mask is cleared. If some store faults, all elements to its right (closer to LSB) will be complete. +## AArch64 + +The AArch64 SVE +(Scalable Vector Extension) introduced the first AArch64 scatter and gather +instructions. + +SVE has two scatter and gather addressing modes (scalar+vector and vector+immediate) +and two predicated contiguous load and store addressing modes (scalar+scalar and +scalar+immediate). The SVE2 instruction set extension adds a third scatter/gather +addressing mode: vector+scalar. + +The predicated contiguous instructions do not use vector based addressing but they do +have other similarities with the scatter and gather instructions which mean DynamoRIO +handles them in a similar way. + +All scatter/gather/predicated contiguous instructions access elements conditionally based +on the mask value of a governing predicate register. Loads are always zeroing: inactive +elements are set to 0 in the destination register. + +### Scalar+vector + +``` +ld1b (%x0,%z0.s,uxtw)[1byte] %p0/z -> %z1.s +``` + +Above is a scalar+vector gather load that reads 8-bit values which are zero-extended and +written to 32-bit elements of the `z1` vector register. Addresses are calculated by +adding the base address from the scalar `x0` register to the corresponding 32-bit element +in the vector index register `z0`. Elements that are not active in the mask contained by +the governing predicate register `p0` are not loaded and the corresponding element in the +destination register `z1` is set to `0`. + +``` +st1h %z17.d %p5 -> (%x19,%z20.d)[2byte] +``` + +Above is a scalar+vector scatter store that writes the lowest 16-bits of the 64-bit +elements of the `z17` vector register. Addresses are calculated by adding the base +address from the scalar `x19` register to the corresponding 64-bit element in the vector +index register `z20`. Elements that are not active in the mask contained by the governing +predicate register `p5` are not stored. + +### Vector+immediate + +``` +ld1sw +0x18(%z8.d)[4byte] %p2/z -> %z6.d +``` +Above is a vector+immediate gather load that reads 32-bit values which are sign-extended +and written to 64-bit elements of the `z6` vector register. Addresses are calculated by +adding the immedate value 0x18 to the corresponding 64-bit element in the vector base +register `z8`. Elements that are not active in the mask contained by governing predicate +register `p2` are zeroed. + +### Vector+scalar + +``` +stnt1d %z27.d %p7 -> (%z25.d,%x29)[8byte] +``` + +Introduced with SVE2. The above instruction is a scatter store that writes the 64-bit +elements of the `z27` vector register. Addresses are calculated by adding the value of +scalar register `x29` to the corresponding 64-bit element in the vector base register +`z25`. Elements that are not active in the mask contained by the governing predicate +register `p7` are not stored. + +### Scalar+scalar + +``` +ld1sb (%x17,%x18)[1byte] %p5/z -> %z16.d +``` + +The above instruction is a scalar+scalar predicated contiguous load. 8-bit values are +loaded and sign extended to 64-bits and written to the vector register `z16`. The address +of the first element is calculated by adding the value of the scalar index register `x18` +to the value of the scalar base register `x17`. Addresses for subsequent elements are +calculated by adding the size of the loaded value (1-byte in this example) to the address +of the previous element. Elements that are not active in the mask contained by governing +predicate register `p5` are zeroed. + +### Scalar+immediate + +``` +st1w %z10.s %p3 -> -0x60(%x11)[4byte] +``` +The above instruction is a scalar+immediate predicated contiguous store that writes the +32-bit elements of the `z10` vector register. The address for the first element is +calculated by adding the immediate value `-0x60` to the value of the scalar base register +`x11`. Addresses for subsequent elements are calculated by adding the size of the stored +value (4 bytes in this example) to the address of the previous element. Elements that are +not active in the mask contained by governing predicate register `p3` are not stored. + +Note that DynamoRIO IR and disassembly for scalar+immediate instructions give the offset +in bytes, but the instruction itself uses a 4-bit signed immediate which is multiplied by +the current SVE vector length in bytes. Arm assembly syntax uses a vector length agnostic +representation for the offset: `#, MUL VL` + +So the above instruction might be encoded as +``` +st1w z10.s, p3, [x11, #-3, MUL VL] +``` +if the currently vector length is 32 bytes, or +``` +st1w z10.s, p3, [x11, #-6, MUL VL] +``` +if it is 16 bytes, and cannot be encoded for vector lengths > 32 bytes. + +### Non-faulting loads + +Non-faulting loads (`ldnf*`) do not fault if an element read faults. Instead a special +predicate register FFR is updated. If this happens the FFR element corresponding to the +element that faulted, and all elements higher than that are set to 0. Elements lower than +the faulting element are unchanged. +Non-faulting loads support scalar+immediate addressing. + +### First-faulting loads + +First-faulting loads (`ldff*`) behave like a normal gather/load instruction if the first +active element causes a fault, and behave like a non-faulting load if the first active +element succeeds but a later active element read causes a fault. +First faulting loads support scalar+scalar, scalar+vector, and vector+immediate +addressing. # Problem Statement @@ -104,10 +228,6 @@ This required the addition of new support in various DynamoRIO components, like drreg, drx, drmgr and core DR. Multiple contributors worked on designing and implementing the required changes. -Note that we expect the same approach to work for other platforms too, like for the -AArch64 SVE scatter/gather instructions. - - ## Scatter/gather Instruction Expansion Owner: [Hendrik Greving](https://github.com/hgreving2304) @@ -120,8 +240,13 @@ num_accesses = vector_size / element_size for i = 0, 1, ..., (num_accesses-1), do extract mask for the ith access from mask reg or mask vector if mask is set, then - extract ith element of index vector - compute address = base + ith index element + if index is vector, then + extract ith element of index vector + compute address = base + ith index element + else // base is vector + extract ith element of base vector + compute address = ith base element + index + done if instr_is_gather, then load data from address into a scalar reg insert scalar data into destination vector @@ -129,7 +254,9 @@ for i = 0, 1, ..., (num_accesses-1), do extract scalar data from source vector to scalar reg store data from scalar reg to address done - clear ith mask in mask reg or mask vector + if x86, then + clear ith mask in mask reg or mask vector + done done done ``` @@ -151,7 +278,7 @@ any client that needs it, including drcachesim. This support was added by As an example, the following are the expansions of some instructions. -Expansion for +Expansion for x86 gather ``` vpgatherdd 0x00402039(,%xmm11,4)[4byte] %xmm13 -> %xmm12 %xmm13 ``` @@ -258,7 +385,7 @@ vpgatherdd 0x00402039(,%xmm11,4)[4byte] %xmm13 -> %xmm12 %xmm13 ``` -Expansion for +Expansion for x86 scatter ``` vpscatterdd {%k1} %xmm10 -> 0x00402039(,%xmm11,4)[4byte] %k1 ``` @@ -376,9 +503,192 @@ vpscatterdd {%k1} %xmm10 -> 0x00402039(,%xmm11,4)[4byte] %k1 +500 m4 @0x00007fdb2ac0baa0