diff --git a/api/docs/scatter_gather_emulation.dox b/api/docs/scatter_gather_emulation.dox
index 29263a2857b..2a6671a7e70 100644
--- a/api/docs/scatter_gather_emulation.dox
+++ b/api/docs/scatter_gather_emulation.dox
@@ -1,5 +1,6 @@
/* ******************************************************************************
* Copyright (c) 2010-2022 Google, Inc. All rights reserved.
+ * Copyright (c) 2024 Arm Limited All rights reserved.
* ******************************************************************************/
/*
@@ -32,12 +33,14 @@
/**
****************************************************************************
-\page page_scatter_gather_emulation Emulating x86 Scatter and Gather Instructions
+\page page_scatter_gather_emulation Emulating Scatter and Gather Instructions
\tableofcontents
# Background
+## x86
+
The x86 gather and scatter instructions were introduced in the AVX2 and AVX512
instruction set extensions. They allow loading or storing a subset of elements in a
vector from/to multiple non-contiguous addresses.
@@ -71,6 +74,127 @@ register `k1`. Elements may be scattered in any order. When an element is stored
its mask is cleared. If some store faults, all elements to its right (closer to
LSB) will be complete.
+## AArch64
+
+The AArch64 SVE
+(Scalable Vector Extension) introduced the first AArch64 scatter and gather
+instructions.
+
+SVE has two scatter and gather addressing modes (scalar+vector and vector+immediate)
+and two predicated contiguous load and store addressing modes (scalar+scalar and
+scalar+immediate). The SVE2 instruction set extension adds a third scatter/gather
+addressing mode: vector+scalar.
+
+The predicated contiguous instructions do not use vector based addressing but they do
+have other similarities with the scatter and gather instructions which mean DynamoRIO
+handles them in a similar way.
+
+All scatter/gather/predicated contiguous instructions access elements conditionally based
+on the mask value of a governing predicate register. Loads are always zeroing: inactive
+elements are set to 0 in the destination register.
+
+### Scalar+vector
+
+```
+ld1b (%x0,%z0.s,uxtw)[1byte] %p0/z -> %z1.s
+```
+
+Above is a scalar+vector gather load that reads 8-bit values which are zero-extended and
+written to 32-bit elements of the `z1` vector register. Addresses are calculated by
+adding the base address from the scalar `x0` register to the corresponding 32-bit element
+in the vector index register `z0`. Elements that are not active in the mask contained by
+the governing predicate register `p0` are not loaded and the corresponding element in the
+destination register `z1` is set to `0`.
+
+```
+st1h %z17.d %p5 -> (%x19,%z20.d)[2byte]
+```
+
+Above is a scalar+vector scatter store that writes the lowest 16-bits of the 64-bit
+elements of the `z17` vector register. Addresses are calculated by adding the base
+address from the scalar `x19` register to the corresponding 64-bit element in the vector
+index register `z20`. Elements that are not active in the mask contained by the governing
+predicate register `p5` are not stored.
+
+### Vector+immediate
+
+```
+ld1sw +0x18(%z8.d)[4byte] %p2/z -> %z6.d
+```
+Above is a vector+immediate gather load that reads 32-bit values which are sign-extended
+and written to 64-bit elements of the `z6` vector register. Addresses are calculated by
+adding the immedate value 0x18 to the corresponding 64-bit element in the vector base
+register `z8`. Elements that are not active in the mask contained by governing predicate
+register `p2` are zeroed.
+
+### Vector+scalar
+
+```
+stnt1d %z27.d %p7 -> (%z25.d,%x29)[8byte]
+```
+
+Introduced with SVE2. The above instruction is a scatter store that writes the 64-bit
+elements of the `z27` vector register. Addresses are calculated by adding the value of
+scalar register `x29` to the corresponding 64-bit element in the vector base register
+`z25`. Elements that are not active in the mask contained by the governing predicate
+register `p7` are not stored.
+
+### Scalar+scalar
+
+```
+ld1sb (%x17,%x18)[1byte] %p5/z -> %z16.d
+```
+
+The above instruction is a scalar+scalar predicated contiguous load. 8-bit values are
+loaded and sign extended to 64-bits and written to the vector register `z16`. The address
+of the first element is calculated by adding the value of the scalar index register `x18`
+to the value of the scalar base register `x17`. Addresses for subsequent elements are
+calculated by adding the size of the loaded value (1-byte in this example) to the address
+of the previous element. Elements that are not active in the mask contained by governing
+predicate register `p5` are zeroed.
+
+### Scalar+immediate
+
+```
+st1w %z10.s %p3 -> -0x60(%x11)[4byte]
+```
+The above instruction is a scalar+immediate predicated contiguous store that writes the
+32-bit elements of the `z10` vector register. The address for the first element is
+calculated by adding the immediate value `-0x60` to the value of the scalar base register
+`x11`. Addresses for subsequent elements are calculated by adding the size of the stored
+value (4 bytes in this example) to the address of the previous element. Elements that are
+not active in the mask contained by governing predicate register `p3` are not stored.
+
+Note that DynamoRIO IR and disassembly for scalar+immediate instructions give the offset
+in bytes, but the instruction itself uses a 4-bit signed immediate which is multiplied by
+the current SVE vector length in bytes. Arm assembly syntax uses a vector length agnostic
+representation for the offset: `#, MUL VL`
+
+So the above instruction might be encoded as
+```
+st1w z10.s, p3, [x11, #-3, MUL VL]
+```
+if the currently vector length is 32 bytes, or
+```
+st1w z10.s, p3, [x11, #-6, MUL VL]
+```
+if it is 16 bytes, and cannot be encoded for vector lengths > 32 bytes.
+
+### Non-faulting loads
+
+Non-faulting loads (`ldnf*`) do not fault if an element read faults. Instead a special
+predicate register FFR is updated. If this happens the FFR element corresponding to the
+element that faulted, and all elements higher than that are set to 0. Elements lower than
+the faulting element are unchanged.
+Non-faulting loads support scalar+immediate addressing.
+
+### First-faulting loads
+
+First-faulting loads (`ldff*`) behave like a normal gather/load instruction if the first
+active element causes a fault, and behave like a non-faulting load if the first active
+element succeeds but a later active element read causes a fault.
+First faulting loads support scalar+scalar, scalar+vector, and vector+immediate
+addressing.
# Problem Statement
@@ -104,10 +228,6 @@ This required the addition of new support in various DynamoRIO components, like
drreg, drx, drmgr and core DR. Multiple contributors worked on designing and
implementing the required changes.
-Note that we expect the same approach to work for other platforms too, like for the
-AArch64 SVE scatter/gather instructions.
-
-
## Scatter/gather Instruction Expansion
Owner: [Hendrik Greving](https://github.com/hgreving2304)
@@ -120,8 +240,13 @@ num_accesses = vector_size / element_size
for i = 0, 1, ..., (num_accesses-1), do
extract mask for the ith access from mask reg or mask vector
if mask is set, then
- extract ith element of index vector
- compute address = base + ith index element
+ if index is vector, then
+ extract ith element of index vector
+ compute address = base + ith index element
+ else // base is vector
+ extract ith element of base vector
+ compute address = ith base element + index
+ done
if instr_is_gather, then
load data from address into a scalar reg
insert scalar data into destination vector
@@ -129,7 +254,9 @@ for i = 0, 1, ..., (num_accesses-1), do
extract scalar data from source vector to scalar reg
store data from scalar reg to address
done
- clear ith mask in mask reg or mask vector
+ if x86, then
+ clear ith mask in mask reg or mask vector
+ done
done
done
```
@@ -151,7 +278,7 @@ any client that needs it, including drcachesim. This support was added by
As an example, the following are the expansions of some instructions.
-Expansion for
+Expansion for x86 gather
```
vpgatherdd 0x00402039(,%xmm11,4)[4byte] %xmm13 -> %xmm12 %xmm13
```
@@ -258,7 +385,7 @@ vpgatherdd 0x00402039(,%xmm11,4)[4byte] %xmm13 -> %xmm12 %xmm13
```
-Expansion for
+Expansion for x86 scatter
```
vpscatterdd {%k1} %xmm10 -> 0x00402039(,%xmm11,4)[4byte] %k1
```
@@ -376,9 +503,192 @@ vpscatterdd {%k1} %xmm10 -> 0x00402039(,%xmm11,4)[4byte] %k1
+500 m4 @0x00007fdb2ac0baa0