-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM Backend using ruy for fp32 and int8 #79
Conversation
Provides an arm backend for matrix multiplies using google/ruy and math functions through simde (https://simd-everywhere.github.io/blog/about/) effectively getting marian-decoder to run on ARM. The following cmake flags are added: - USE_INTGEMM (to switch intgemm on/off) - USE_RUY (to switch ruy on/off) - USE_ONNX_SGEMM (use onnx sgemm added by wasm to provide attention matrix multiply which is currently reliant on a BLAS library). - USE_SIMDE (swaps out existing intel based functions by using SIMDE instead). The built marian-decoder is tested on an Oracle Cloud ARM Machine with the following specs: Architecture : aarch64 CPU op-mode(s) : 32-bit, 64-bit Byte Order : Little Endian Vendor ID : ARM Model name : Neoverse-N1 Flags : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs A CI check on GitHub actions is added to use android-ndk cross-compile targetting arm64-v8a. The built binary is tested to work on an Android Phone using termux (Samsung M30s). Successful android build additionally requires a patch (sentencepiece -> protobuf). See opencv/opencv#17282 and opencv/opencv#19049. -Werror etc causes issues with ruy (-Wmulti-line-comment) and are disabled. The following minor changes are also applied: - Remove M32_BINARIES use COMPILE_WASM for -m32 - Hide msse4.1 if unknown platform - faiss was previously hardcoded for platforms with SSE available. This has been mitigated by adding a refernce standard cpp implementation of the missing function. - Exclude packed_gemm_....cpp from sources if USE_FBGEMM=off - MSVC workaround following browsermt#56 (comment)
The cmake-flags, BLAS marian-nmt#762 looks like a different approach using the units here will be undertaken for marian-nmt/marian-dev. Requesting @XapaJIaMnu, @kpu, @graemenail for feedback on how to simplify the situation. I'm hoping devs who have awareness of the bigger picture can point me towards the appropriate things to do. I have gotten this to work on Oracle Machines and Android - I expect to have access to an M1 Macbook Air in May. |
Don't use the ONNX GEMM. It should be using ruy. Since ruy has We would then have a library that does |
ONNX SGEMM is used only by WebAssembly and Android (for now). On Mac M1 Apple Accelerate or a BLAS library is available. OpenBLAS is used for time being in Oracle cloud ARM machine. I have isolated enough to get sgemm through ruy in https://github.com/jerinphilip/sgemm/blob/28dc786d821d3abb2acf086e1fb145e58cc55372/src/main.cpp#L54-L75 for now. This looks like it will take longer - the optimized primitive I find in ruy is If available, I'd like to focus on integration here (without adding more source complicating review) and bring in a PR behind finding for ruy based sgemm for Android.
|
Ruy provides a bias term here: Regarding layout (transpose and stride arguments), these are supported: https://github.com/google/ruy/blob/2d950b3bfa7ebfbe7a97ecb44b1cc4da5ac1d6f0/example/parametrized_example.cc#L84-L91 There does not appear to be support for And with that you would have the features of sgemm. Eigen's performance on x86 has been bad, not sure about ARM. |
Maybe I'm missing something, what do we need bias term for? We multiply The layout-based transpose is available - looks like https://github.com/google/ruy/blob/8c3fd3f266b4a22d542d4aa41329b5018d6b87e1/ruy/test.h#L1157-L1164. I am currently looking into using this.
Awaiting feedback on the remaining parts meanwhile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, use ruy for fp32 and https://github.com/JishinMaster/simd_utils/ for the functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes to self.
@@ -218,11 +218,13 @@ class ExpressionGraphPackable : public ExpressionGraph { | |||
cols(val)); | |||
//Put the quantMult at the back of the tensor | |||
*(reinterpret_cast<float *>(paramMat->data<int8_t>() + val->shape().elements())) = quantMult; | |||
#else | |||
ABORT("Int8::PrepareA not implemented yet for ruy"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compile error moved to runtime, but haven't managed to trigger it in runs. I wonder why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Ruy PrepareA is the same as PrepareB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some partial comments, from my (im)partial review.
src/tensors/cpu/ruy_adapter.h
Outdated
return result; | ||
} | ||
|
||
static void PrepareBias(const float *input, float *output, Index rows, Index cols) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be using marian tensors here, as they allow for zero copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more precise, when being inside the marian tensor ecosystem, you can choose to return the same unmodified tensor from prepareBias (identity operation that just returns the input tensor) and avoid the copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ruy_adapter.h
sits at the same level as intgemm
. Tensor
in arguments is therefore unsuitable. Marian provides Tensor
. Marian uses ruy_adapter
/ intgemm
. My preferred solution to this is to capture the variations as callback/first-class functions and do f(args) to not duplicate code. These primitives should be supplied by this file/layer, without pulling in marian::Tensor
.
Edit: callback could be unquantize when we don't add bias, and unquantizeAddBias when we have bias, switch can happen at call-site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RUY backend should NEVER use prepareBias. Calling this function from insude ARM should result in an Abort and is definitely a programmer mistake. ARM supports proper int8_t * int8_t
and therefore doesn't need the prepare bias mumbo jumbo which we do in intgemm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RUY backend should NEVER use prepareBias.
I agree. I understand the objective here is to avoid the std::memcpy
which can be done by changing the calling code. I will try to find a solution that achieves the intent at integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that this function should never be called from ruy as it will be slower than the other code path. It should abort. The "shifted" code path shouldn't be taken by this backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on today's call, since ARM has 8-bit signed * signed, the relevant code path to follow is int8 (not int8shifted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kpu suggested lifting ruy path from intgemm_interface.h
and adding at a level of gemm-provider. Much of the existing state is due to coming over from https://github.com/jerinphilip/MozIntGemm. This will involve writing some equivalent of intgemm_interface
to complete the calls (i.e duplication). Please let me know if it's a go on this one.
I agree. I understand the objective here is to avoid the std::memcpy which can be done by changing the calling code. I will try to find a solution that achieves the intent at integration.
The std::memcpy
was probably useless, got rid of it and a bunch of nullptr
in the args in the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kpu suggested lifting ruy path from intgemm_interface.h and adding at a level of gemm-provider. Much of the existing state is due to coming over from jerinphilip/MozIntGemm. This will involve writing some equivalent of intgemm_interface to complete the calls (i.e duplication). Please let me know if it's a go on this one.
Without documentation (for intgemm_interface.h
, the integration to marian's graph system) or active help from someone who understands the area, I'm afraid I will not be able to do this in the immediate future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have read https://github.com/browsermt/marian-dev/blob/master/src/tensors/cpu/intgemm_interface.h and determined that, while it could use some comments, it is in state where any software engineer working for me is expected to read though and understand or come up with questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small review
|
||
target_architecture(CMAKE_TARGET_ARCHITECTURES) | ||
list(LENGTH CMAKE_TARGET_ARCHITECTURES cmake_target_arch_len) | ||
if(NOT "${cmake_target_arch_len}" STREQUAL "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean, sorry? Is this 32bit vs 64bit? A small clarifying comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this catch the unknown
arch condition, and is that desirable?
@@ -218,11 +218,13 @@ class ExpressionGraphPackable : public ExpressionGraph { | |||
cols(val)); | |||
//Put the quantMult at the back of the tensor | |||
*(reinterpret_cast<float *>(paramMat->data<int8_t>() + val->shape().elements())) = quantMult; | |||
#else | |||
ABORT("Int8::PrepareA not implemented yet for ruy"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Ruy PrepareA is the same as PrepareB
|
||
ruy::Matrix<float> lhs; | ||
ruy::MakeSimpleLayout(M, K, orderA, lhs.mutable_layout()); | ||
lhs.set_data(A); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is zero copy, right, just pointer set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean this statement, yes. It's a pointer set. However, I think ruy might be allocating/deallocating under the hood based on its needs of a layout change? https://github.com/google/ruy/blob/38a9266b832767a3f535a74a9e0cf39f7892e594/ruy/prepare_packed_matrices.cc#L69-L92
} | ||
} | ||
|
||
struct UnquantizeAndAddBiasAndWrite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obligatory comment that it is unnecessary to code it in this way, but we have already covered that.
Preprocess<kHighestPath>::quantize(input, output, quant_mult, rows, cols); | ||
} | ||
|
||
static void SelectColumnsB(const Type *input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerinphilip did you try vanilla index_select?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments while I catch up on the full discussion
CMakeLists.txt
Outdated
@@ -266,7 +308,8 @@ else(MSVC) | |||
set(DISABLE_GLOBALLY "-Wno-unused-result ${CLANG_IGNORE_UNKNOWN_CUDA} ${CLANG_IGNORE_UNUSED_VALUES}") # This needs to appear here as well to appease clang11+ on linux | |||
|
|||
# These are used in src/CMakeLists.txt on a per-target basis | |||
list(APPEND ALL_WARNINGS -Wall; -Werror; -Wextra; -Wno-unused-result; -Wno-deprecated; | |||
list(APPEND ALL_WARNINGS -Wall; # -Werror; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What warning was introduced that made removal of -Werror
necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Ruy has a lot of things that fire on
-Werror
, this is what made me remove it. jerinphilip/marian@e1c3f7a
(#4) - simd_utils has strict aliases/type-punning warnings which become errors.
Do I do https://stackoverflow.com/a/3308675/4565794 to get around this? I can break it and get around it by something with the headers, I hope. Upstream appears to want to keep -Werror
(marian-nmt#598 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep -Werror
and the solution you linked is acceptable as it documents exactly which warnings have been disabled and ties them to a particular header. To me, this satisfies the "OK to disable warnings in 3rd-party code after checking them once" from upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record: https://godbolt.org/z/6Mzhc1Tqq
|
||
target_architecture(CMAKE_TARGET_ARCHITECTURES) | ||
list(LENGTH CMAKE_TARGET_ARCHITECTURES cmake_target_arch_len) | ||
if(NOT "${cmake_target_arch_len}" STREQUAL "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this catch the unknown
arch condition, and is that desirable?
While Android builds appear happy with What remains is |
This reverts commit 4b80399.
There's a The implications of the changes in CMake in putting RUY_SGEMM above openblas/cblas need to be understood. BLAS_FOUND is used in LSH, node inits, unit tests; we need to determine what BLAS features are required there. |
@@ -10,11 +10,9 @@ | |||
#if MKL_FOUND | |||
#include <mkl.h> | |||
#elif BLAS_FOUND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need SGEMM here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost everything here is handled in prod_blas.h
, then there's the MKL batched gemm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant we don't need a full BLAS
flag, we need only an SGEMM
implementation. faiss (LSH) blas found is okay, because it uses a lot of BLAS (in some sense). The switch here can be narrowed to something GEMM
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get rid of the includes here, leaving only
#if MKL_FOUND
#include <mkl.h>
#endif
since we directly call MKL functions here.
Agreed?
To update the others, this is isolated into a GCC bug(?). Clang (which is cross compiling for android) does not complain. A minimum reproducible example is https://godbolt.org/z/6Mzhc1Tqq, already linked in review-comments. Clang works fine on Oracle ARM machine. |
Provides an arm backend for matrix multiplies using google/ruy and math
functions through simde (https://simd-everywhere.github.io/blog/about/)
effectively getting marian-decoder to run on ARM.
The following cmake flags are added:
matrix multiply which is currently reliant on a BLAS library). This previously
used to be WASM_COMPATIBLE_BLAS.
instead).
The built marian-decoder is tested on an Oracle Cloud ARM Machine with
the following specs:
A CI check on GitHub actions is added to use android-ndk cross-compile
targetting arm64-v8a. The built binary is tested to work on an Android
Phone using termux (Samsung M30s).
Successful android build additionally requires a patch (sentencepiece ->
protobuf). See opencv/opencv#17282 and
opencv/opencv#19049.
-Werror etc causes issues with ruy (-Wmulti-line-comment) and are
disabled.
The following minor changes are also applied:
has been mitigated by adding a refernce standard cpp implementation of
the missing function.
Import matrix-multiply from a separate wasm module #56 (comment)
Status
intgemm_interface.h
.