-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml: GGML_NATIVE uses -mcpu=native on ARM #10752
ggml: GGML_NATIVE uses -mcpu=native on ARM #10752
Conversation
if (NOT USER_PROVIDED_MARCH) | ||
list(APPEND ARCH_FLAGS "-march=native") | ||
endif() | ||
else() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting all the old code in a else
might be too drastic but I guess the other cases are only relevant when cross-compiling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the other code in fact seems to be doing the same thing that -march=native
would do. GGML_NATIVE
disabled should generate a consistent build depending on the flags specified during compilation, which is not the case at the moment.
This needs to be completely revamped, and as it is, this PR is just adding to the mess that will need to be cleaned up later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can help with revamping but I need some clarification first :)
Today, building on generic ARM gives very poor performance because the build system completely ignores the GGML_NATIVE
directive, so I just aligned the current code with the current description of GGML_NATIVE
found in ggml/CMakeLists.txt
:
option(GGML_NATIVE "ggml: enable -march=native flag" ${GGML_NATIVE_DEFAULT})
I completely agree that it's not the best way to get performance but it's better than nothing and it already fixes many modern setups.
So, do we want to relax the definition of GGML_NATIVE
and allow to use, for example, -mcpu=native
on ARM which would be much better for performance?
The old code was clearly aimed at small devices, like android and raspberry and also used CMAKE_SYSTEM_PROCESSOR
so for me it wasn't used as a way to fix -march=native
at all but rather as a way to find acceptable flags when cross-compiling and in this case you really don't want GGML_NATIVE
(hence the move to else
).
Maybe @ggerganov has some memories to share about that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe these flags were mostly set by trial and error, back when we were running whisper.cpp
on some raspberries. But this is very likely wrong as I didn't really understand the specifics and should be revamped. I'm not really an expert and I still get quite confused with all the different Arm architectures, so whatever you think makes sense to improve this is welcome. I can test changes on the entire spectrum of Apple Silicon if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, with gcc/clang it is enough to set the correct architecture flags with -march
, and -match=native
should work in the same way as x86. The exception is likely to be MSVC once again, because it does not set the preprocessor definitions of the enabled ARM features. In which case, we may consider just dropping support for MSVC with ARM entirely, because it is a constant source of problems, doesn't work with the inline asm kernels, and doesn't really add anything beyond clang or possibly MINGW.
I believe this should work:
- Set
-march=native
ifGGML_NATIVE
is enabled - Add a parameter
GGML_CPU_ARCH
to the build to set the architecture, so that ifGGML_NATIVE
is disabled and this parameter is provided, then-march=${GGML_CPU_ARCH}
is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On ARM, to build for local use (i.e. using GGML_NATIVE
), -mcpu=native
alone should be the best option as far as I know. The -march=native
will often miss some opportunities. The -mtune=native
will optimize for the current microarchitecture (so still not fully optimized for the cpu).
So I think redefining GGML_NATIVE
to something like "Try to optimize builds for the current cpu" and using -march=native
on x86_64
and -mcpu=native
on ARM
would already be much simpler and an improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this sounds good to you, I can adapt this PR in this direction so we can see how it works in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sounds good.
This CI/CD error is not clear to me:
My M3 laptop passes all tests:
If anyone has any ideas on what I could do to reproduce the error and try to fix it, I'm interested :) |
195499d
to
806755f
Compare
I have no idea what's going on with the ARM build flags. Why is it handled differently for Apple than for other ARM platforms? It would be great if all of this could be cleaned up and unified into a single branch for ARM in the build. |
You can ignore this error. |
I totally agree the ARM section could be much simpler, here I'm just fixing |
If I recall correctly, |
My experience is that Here I fix the |
69361bf
to
3215005
Compare
3215005
to
3fdfeb1
Compare
I have updated the PR to use Apple M3:
AWS graviton4 (ubuntu):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, M1 Pro now also works correctly:
-- ARM detected
-- ARM feature DOTPROD enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native
This is good, but the |
Totally agree but I need to find some devices to test before touching this code. In the meantime, I don't think the current code brings any serious regression:
So I think the risk is fairly low in the current state and we fix some issues. |
The problem with the |
It tries to do the same thing but from a different system than the local one which is why it is the only place where Another possibility is to remove the code completely, as |
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
7955eb1
to
1dae1d8
Compare
Naively building on graviton4 with and without this PR:
|
|
Hm, you are correct - it does not detect the MMI8 feature neither on M2 Ultra nor M4 Max: 20:32:21 ▶ master ▶ 81⎘ ▶ $ ▶ git-pr 10752
remote: Enumerating objects: 13, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 13 (delta 8), reused 8 (delta 8), pack-reused 3 (from 1)
Unpacking objects: 100% (13/13), 7.22 KiB | 616.00 KiB/s, done.
From https://github.com/ggerganov/llama.cpp
* [new ref] refs/pull/10752/head -> pr/10752
Switched to branch 'pr/10752'
ggerganov ▶ gg-studio ▶ ~/development/github/llama.cpp ▶
20:32:32 ▶ pr/10752 ▶ 81⎘ ▶ cmake -B build-arm
-- The C compiler identification is AppleClang 16.0.0.16000026
-- The CXX compiler identification is AppleClang 16.0.0.16000026
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /opt/homebrew/bin/git (found version "2.41.0")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: arm64
-- Including CPU backend
-- Accelerate framework found
-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
-- Could NOT find OpenMP_CXX (missing: OpenMP_CXX_FLAGS OpenMP_CXX_LIB_NAMES)
-- Could NOT find OpenMP (missing: OpenMP_C_FOUND OpenMP_CXX_FOUND)
CMake Warning at ggml/src/ggml-cpu/CMakeLists.txt:53 (message):
OpenMP not found
Call Stack (most recent call first):
ggml/src/CMakeLists.txt:298 (ggml_add_cpu_backend_variant_impl)
-- ARM detected
-- ARM feature DOTPROD enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native
-- Looking for dgemm_
-- Looking for dgemm_ - found
-- Found BLAS: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/Accelerate.framework
-- BLAS found, Libraries: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/Accelerate.framework
-- BLAS found, Includes:
-- Including BLAS backend
-- Metal framework found
-- The ASM compiler identification is AppleClang
-- Found assembler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Including METAL backend
-- Configuring done (1.1s)
-- Generating done (0.4s) |
I am continuing this in #10890. |
When building with
cmake
locally on a generic ARM (a platform not explicitly handled like Apple) theGGML_NATIVE
doesn't work.Before:
After: