ggml: GGML_NATIVE uses -mcpu=native on ARM #10752

angt · 2024-12-10T11:23:05Z

When building with cmake locally on a generic ARM (a platform not explicitly handled like Apple) the GGML_NATIVE doesn't work.
Before:

$ rm -rf build && cmake -B build | grep ggml-cpu
-- Adding CPU backend variant ggml-cpu:

After:

$ rm -rf build && cmake -B build | grep ggml-cpu
-- Adding CPU backend variant ggml-cpu: -march=native

angt · 2024-12-10T11:28:20Z

ggml/src/ggml-cpu/CMakeLists.txt

+                if (NOT USER_PROVIDED_MARCH)
+                    list(APPEND ARCH_FLAGS "-march=native")
+                endif()
+            else()


Putting all the old code in a else might be too drastic but I guess the other cases are only relevant when cross-compiling.

Yes, the other code in fact seems to be doing the same thing that -march=native would do. GGML_NATIVE disabled should generate a consistent build depending on the flags specified during compilation, which is not the case at the moment.

This needs to be completely revamped, and as it is, this PR is just adding to the mess that will need to be cleaned up later.

I can help with revamping but I need some clarification first :)

Today, building on generic ARM gives very poor performance because the build system completely ignores the GGML_NATIVE directive, so I just aligned the current code with the current description of GGML_NATIVE found in ggml/CMakeLists.txt :

option(GGML_NATIVE "ggml: enable -march=native flag" ${GGML_NATIVE_DEFAULT})

I completely agree that it's not the best way to get performance but it's better than nothing and it already fixes many modern setups.

So, do we want to relax the definition of GGML_NATIVE and allow to use, for example, -mcpu=native on ARM which would be much better for performance?

The old code was clearly aimed at small devices, like android and raspberry and also used CMAKE_SYSTEM_PROCESSOR so for me it wasn't used as a way to fix -march=native at all but rather as a way to find acceptable flags when cross-compiling and in this case you really don't want GGML_NATIVE (hence the move to else).

Maybe @ggerganov has some memories to share about that ?

Yes, I believe these flags were mostly set by trial and error, back when we were running whisper.cpp on some raspberries. But this is very likely wrong as I didn't really understand the specifics and should be revamped. I'm not really an expert and I still get quite confused with all the different Arm architectures, so whatever you think makes sense to improve this is welcome. I can test changes on the entire spectrum of Apple Silicon if necessary.

If I understand correctly, with gcc/clang it is enough to set the correct architecture flags with -march, and -match=native should work in the same way as x86. The exception is likely to be MSVC once again, because it does not set the preprocessor definitions of the enabled ARM features. In which case, we may consider just dropping support for MSVC with ARM entirely, because it is a constant source of problems, doesn't work with the inline asm kernels, and doesn't really add anything beyond clang or possibly MINGW.

I believe this should work:

Set -march=native if GGML_NATIVE is enabled

Add a parameter GGML_CPU_ARCH to the build to set the architecture, so that if GGML_NATIVE is disabled and this parameter is provided, then -march=${GGML_CPU_ARCH} is used.

On ARM, to build for local use (i.e. using GGML_NATIVE), -mcpu=native alone should be the best option as far as I know. The -march=native will often miss some opportunities. The -mtune=native will optimize for the current microarchitecture (so still not fully optimized for the cpu).

So I think redefining GGML_NATIVE to something like "Try to optimize builds for the current cpu" and using -march=native on x86_64 and -mcpu=native on ARM would already be much simpler and an improvement.

If this sounds good to you, I can adapt this PR in this direction so we can see how it works in practice.

Yes, sounds good.

angt · 2024-12-10T15:26:22Z

This CI/CD error is not clear to me:

CPY(type_src=f32,type_dst=q5_1,ne=[256,2,3,4],permute=[0,2,1,3]): [CPY] NMSE = 0.000001799 > 0.000001000 FAIL

My M3 laptop passes all tests:

$ ctest               
Test project /Users/angt/llama.cpp/build
      Start  1: test-tokenizer-0-bert-bge
 1/30 Test  #1: test-tokenizer-0-bert-bge .........   Passed    0.04 sec
      Start  2: test-tokenizer-0-command-r
 2/30 Test  #2: test-tokenizer-0-command-r ........   Passed    0.31 sec
      Start  3: test-tokenizer-0-deepseek-coder
 3/30 Test  #3: test-tokenizer-0-deepseek-coder ...   Passed    0.06 sec
      Start  4: test-tokenizer-0-deepseek-llm
 4/30 Test  #4: test-tokenizer-0-deepseek-llm .....   Passed    0.12 sec
      Start  5: test-tokenizer-0-falcon
 5/30 Test  #5: test-tokenizer-0-falcon ...........   Passed    0.08 sec
      Start  6: test-tokenizer-0-gpt-2
 6/30 Test  #6: test-tokenizer-0-gpt-2 ............   Passed    0.06 sec
      Start  7: test-tokenizer-0-llama-bpe
 7/30 Test  #7: test-tokenizer-0-llama-bpe ........   Passed    0.24 sec
      Start  8: test-tokenizer-0-llama-spm
 8/30 Test  #8: test-tokenizer-0-llama-spm ........   Passed    0.04 sec
      Start  9: test-tokenizer-0-mpt
 9/30 Test  #9: test-tokenizer-0-mpt ..............   Passed    0.07 sec
      Start 10: test-tokenizer-0-phi-3
10/30 Test #10: test-tokenizer-0-phi-3 ............   Passed    0.03 sec
      Start 11: test-tokenizer-0-qwen2
11/30 Test #11: test-tokenizer-0-qwen2 ............   Passed    0.16 sec
      Start 12: test-tokenizer-0-refact
12/30 Test #12: test-tokenizer-0-refact ...........   Passed    0.07 sec
      Start 13: test-tokenizer-0-starcoder
13/30 Test #13: test-tokenizer-0-starcoder ........   Passed    0.07 sec
      Start 14: test-tokenizer-1-llama-spm
14/30 Test #14: test-tokenizer-1-llama-spm ........   Passed    0.15 sec
      Start 15: test-log
15/30 Test #15: test-log ..........................   Passed    0.01 sec
      Start 16: test-arg-parser
16/30 Test #16: test-arg-parser ...................   Passed    0.02 sec
      Start 17: test-sampling
17/30 Test #17: test-sampling .....................   Passed    0.74 sec
      Start 18: test-chat-template
18/30 Test #18: test-chat-template ................   Passed    0.00 sec
      Start 19: test-grammar-parser
19/30 Test #19: test-grammar-parser ...............   Passed    0.00 sec
      Start 20: test-grammar-integration
20/30 Test #20: test-grammar-integration ..........   Passed    0.01 sec
      Start 21: test-llama-grammar
21/30 Test #21: test-llama-grammar ................   Passed    0.00 sec
      Start 22: test-backend-ops
22/30 Test #22: test-backend-ops ..................   Passed   29.79 sec
      Start 23: test-model-load-cancel
23/30 Test #23: test-model-load-cancel ............   Passed    0.24 sec
      Start 24: test-autorelease
24/30 Test #24: test-autorelease ..................   Passed    0.17 sec
      Start 25: test-barrier
25/30 Test #25: test-barrier ......................   Passed    0.23 sec
      Start 26: test-quantize-fns
26/30 Test #26: test-quantize-fns .................   Passed   15.27 sec
      Start 27: test-quantize-perf
27/30 Test #27: test-quantize-perf ................   Passed    0.03 sec
      Start 28: test-rope
28/30 Test #28: test-rope .........................   Passed    0.02 sec
      Start 29: test-json-schema-to-grammar
29/30 Test #29: test-json-schema-to-grammar .......   Passed    1.83 sec
      Start 30: test-eval-callback
30/30 Test #30: test-eval-callback ................   Passed    0.42 sec

100% tests passed, 0 tests failed out of 30

Label Time Summary:
curl             =   0.42 sec*proc (1 test)
eval-callback    =   0.42 sec*proc (1 test)
main             =  49.47 sec*proc (27 tests)
model            =   0.41 sec*proc (2 tests)

Total Test time (real) =  50.32 sec

If anyone has any ideas on what I could do to reproduce the error and try to fix it, I'm interested :)

slaren · 2024-12-11T15:09:32Z

I have no idea what's going on with the ARM build flags. Why is it handled differently for Apple than for other ARM platforms? It would be great if all of this could be cleaned up and unified into a single branch for ARM in the build.

slaren · 2024-12-11T15:10:21Z

CPY(type_src=f32,type_dst=q5_1,ne=[256,2,3,4],permute=[0,2,1,3]): [CPY] NMSE = 0.000001799 > 0.000001000 FAIL

You can ignore this error.

angt · 2024-12-11T15:23:36Z

I have no idea what's going on with the ARM build flags. Why is it handled differently for Apple than for other ARM platforms? It would be great if all of this could be cleaned up and unified into a single branch for ARM in the build.

I totally agree the ARM section could be much simpler, here I'm just fixing cmake to make it work like the Makefile (soon deprecated a priori) as a quick win.

slaren · 2024-12-11T15:36:50Z

If I recall correctly, -march=native is not used on ARM because in some cases the feature definitions (eg. __ARM_FEATURE_MATMUL_INT8) were not being added by the compiler, so they would not be used regardless. I believe this is why there is code to test the different flags.

angt · 2024-12-11T15:55:43Z

If I recall correctly, -march=native is not used on ARM because in some cases the feature definitions (eg. __ARM_FEATURE_MATMUL_INT8) were not being added by the compiler, so they would not be used regardless. I believe this is why there is code to test the different flags.

My experience is that --march=native alone is pretty good but trying to mix it with -mtune break everything (which is common from x86_64 users).

Here I fix the GGML_NATIVE for arm which I think can be useful in many cases. And the real issue is that it is enabled by default for arm, so maybe we can disable it by default for this platform and let people decide to use it or not ?

angt · 2024-12-17T11:20:21Z

I have updated the PR to use -mcpu=native alone on ARM with GGML_NATIVE:

Apple M3:

-- ARM detected
-- ARM feature DOTPROD enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native

AWS graviton4 (ubuntu):

-- ARM detected
-- ARM feature DOTPROD enabled
-- ARM feature SVE enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native

ggerganov

Great, M1 Pro now also works correctly:

-- ARM detected
-- ARM feature DOTPROD enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native

slaren · 2024-12-17T15:38:58Z

This is good, but the GGML_NATIVE disabled path still needs to be fixed.

angt · 2024-12-17T19:34:06Z

This is good, but the GGML_NATIVE disabled path still needs to be fixed.

Totally agree but I need to find some devices to test before touching this code.

In the meantime, I don't think the current code brings any serious regression:

on classic ARM (aarch64) the old code won't have any impact. I don't think the FP16_FORMAT_I3E test will cause any problems.
GGML_SVE is now only used if GGML_NATIVE is disabled, but that makes sense.
when cross-compiling, GGML_NATIVE is automatically deactivated and we fallback to the old code as expected.

So I think the risk is fairly low in the current state and we fix some issues.

slaren · 2024-12-17T19:44:55Z

The problem with the GGML_NATIVE disabled path is that it tries to do what GGML_NATIVE already does, so it is fundamentally wrong. The fix is not complicated, just remove the current code and add an option so that the user can choose the value to pass to -march, as outlined earlier.

ggml/CMakeLists.txt

angt · 2024-12-17T21:04:35Z

The problem with the GGML_NATIVE disabled path is that it tries to do what GGML_NATIVE already does, so it is fundamentally wrong. The fix is not complicated, just remove the current code and add an option so that the user can choose the value to pass to -march, as outlined earlier.

It tries to do the same thing but from a different system than the local one which is why it is the only place where CMAKE_SYSTEM_PROCESSOR is used and not CMAKE_HOST_SYSTEM_PROCESSOR. And sadly -march and -mcpu won't be enough in that case.

Another possibility is to remove the code completely, as ggml is a library, and then add *.cmake files to llama.cpp and whisper.cpp for cross-compilation for raspberry pi and adnroid, adding all the necessary flags.

Signed-off-by: Adrien Gallouët <[email protected]>

angt · 2024-12-18T08:23:19Z

Naively building on graviton4 with and without this PR:

| Model         | Test   |   t/s master |   t/s 57498e73 |   Speedup |
|:--------------|:-------|-------------:|---------------:|----------:|
| llama 1B Q4_0 | pp512  |       184.38 |        1149.10 |      6.23 |
| llama 1B Q4_0 | tg128  |       107.76 |         175.41 |      1.63 |

slaren · 2024-12-18T18:07:43Z

-mcpu=native does not enable __ARM_FEATURE_MATMUL_INT8 on M3 Max. @ggerganov you mentioned that it worked correctly on M1 Pro, did you check if it also enables all the features in the M2/M4?

ggerganov · 2024-12-18T18:37:35Z

Hm, you are correct - it does not detect the MMI8 feature neither on M2 Ultra nor M4 Max:

20:32:21 ▶ master ▶ 81⎘ ▶ $ ▶ git-pr 10752
remote: Enumerating objects: 13, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 13 (delta 8), reused 8 (delta 8), pack-reused 3 (from 1)
Unpacking objects: 100% (13/13), 7.22 KiB | 616.00 KiB/s, done.
From https://github.com/ggerganov/llama.cpp
 * [new ref]             refs/pull/10752/head -> pr/10752
Switched to branch 'pr/10752'
 ggerganov ▶ gg-studio ▶ ~/development/github/llama.cpp ▶
20:32:32 ▶ pr/10752 ▶ 81⎘ ▶ cmake -B build-arm
-- The C compiler identification is AppleClang 16.0.0.16000026
-- The CXX compiler identification is AppleClang 16.0.0.16000026
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /opt/homebrew/bin/git (found version "2.41.0") 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: arm64
-- Including CPU backend
-- Accelerate framework found
-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) 
-- Could NOT find OpenMP_CXX (missing: OpenMP_CXX_FLAGS OpenMP_CXX_LIB_NAMES) 
-- Could NOT find OpenMP (missing: OpenMP_C_FOUND OpenMP_CXX_FOUND) 
CMake Warning at ggml/src/ggml-cpu/CMakeLists.txt:53 (message):
  OpenMP not found
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:298 (ggml_add_cpu_backend_variant_impl)


-- ARM detected
-- ARM feature DOTPROD enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native 
-- Looking for dgemm_
-- Looking for dgemm_ - found
-- Found BLAS: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/Accelerate.framework  
-- BLAS found, Libraries: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/Accelerate.framework
-- BLAS found, Includes: 
-- Including BLAS backend
-- Metal framework found
-- The ASM compiler identification is AppleClang
-- Found assembler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Including METAL backend
-- Configuring done (1.1s)
-- Generating done (0.4s)

slaren · 2024-12-18T18:44:57Z

I am continuing this in #10890.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 10, 2024

angt commented Dec 10, 2024

View reviewed changes

angt force-pushed the ggml-allow-march-native-on-generic-arm-platforms branch from 195499d to 806755f Compare December 11, 2024 15:05

angt force-pushed the ggml-allow-march-native-on-generic-arm-platforms branch 2 times, most recently from 69361bf to 3215005 Compare December 16, 2024 14:57

angt changed the title ~~ggml: allow -march=native on generic ARM platforms~~ ggml: GGML_NATIVE uses -mcpu=native on ARM Dec 16, 2024

angt force-pushed the ggml-allow-march-native-on-generic-arm-platforms branch from 3215005 to 3fdfeb1 Compare December 16, 2024 15:16

ggerganov approved these changes Dec 17, 2024

View reviewed changes

slaren reviewed Dec 17, 2024

View reviewed changes

ggml/CMakeLists.txt Outdated Show resolved Hide resolved

angt added 2 commits December 18, 2024 08:13

ggml: GGML_NATIVE uses -mcpu=native on ARM

7eb81e1

Signed-off-by: Adrien Gallouët <[email protected]>

ggml: Show detected features with GGML_NATIVE

1dae1d8

Signed-off-by: Adrien Gallouët <[email protected]>

angt force-pushed the ggml-allow-march-native-on-generic-arm-platforms branch from 7955eb1 to 1dae1d8 Compare December 18, 2024 08:14

smpurkis mentioned this pull request Dec 18, 2024

ggml-cpu: re-add AArch64 NEON assembly for ggml_gemv_q4_0_4x4_q8_0() for non-dotprod #10889

Closed

4 tasks

slaren mentioned this pull request Dec 18, 2024

ggml : fix arm build #10890

Merged

slaren closed this Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: GGML_NATIVE uses -mcpu=native on ARM #10752

ggml: GGML_NATIVE uses -mcpu=native on ARM #10752

angt commented Dec 10, 2024

angt Dec 10, 2024 •

edited

Loading

slaren Dec 11, 2024

angt Dec 13, 2024

ggerganov Dec 13, 2024

slaren Dec 13, 2024

angt Dec 13, 2024

angt Dec 13, 2024

slaren Dec 13, 2024

angt commented Dec 10, 2024 •

edited

Loading

slaren commented Dec 11, 2024 •

edited

Loading

slaren commented Dec 11, 2024

angt commented Dec 11, 2024 •

edited

Loading

slaren commented Dec 11, 2024

angt commented Dec 11, 2024

angt commented Dec 17, 2024

ggerganov left a comment

slaren commented Dec 17, 2024

angt commented Dec 17, 2024

slaren commented Dec 17, 2024 •

edited

Loading

angt commented Dec 17, 2024

angt commented Dec 18, 2024

slaren commented Dec 18, 2024

ggerganov commented Dec 18, 2024

slaren commented Dec 18, 2024

ggml: GGML_NATIVE uses -mcpu=native on ARM #10752

ggml: GGML_NATIVE uses -mcpu=native on ARM #10752

Conversation

angt commented Dec 10, 2024

angt Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

slaren Dec 11, 2024

Choose a reason for hiding this comment

angt Dec 13, 2024

Choose a reason for hiding this comment

ggerganov Dec 13, 2024

Choose a reason for hiding this comment

slaren Dec 13, 2024

Choose a reason for hiding this comment

angt Dec 13, 2024

Choose a reason for hiding this comment

angt Dec 13, 2024

Choose a reason for hiding this comment

slaren Dec 13, 2024

Choose a reason for hiding this comment

angt commented Dec 10, 2024 • edited Loading

slaren commented Dec 11, 2024 • edited Loading

slaren commented Dec 11, 2024

angt commented Dec 11, 2024 • edited Loading

slaren commented Dec 11, 2024

angt commented Dec 11, 2024

angt commented Dec 17, 2024

ggerganov left a comment

Choose a reason for hiding this comment

slaren commented Dec 17, 2024

angt commented Dec 17, 2024

slaren commented Dec 17, 2024 • edited Loading

angt commented Dec 17, 2024

angt commented Dec 18, 2024

slaren commented Dec 18, 2024

ggerganov commented Dec 18, 2024

slaren commented Dec 18, 2024

angt Dec 10, 2024 •

edited

Loading

angt commented Dec 10, 2024 •

edited

Loading

slaren commented Dec 11, 2024 •

edited

Loading

angt commented Dec 11, 2024 •

edited

Loading

slaren commented Dec 17, 2024 •

edited

Loading