Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml: GGML_NATIVE uses -mcpu=native on ARM #10752

Conversation

angt
Copy link
Contributor

@angt angt commented Dec 10, 2024

When building with cmake locally on a generic ARM (a platform not explicitly handled like Apple) the GGML_NATIVE doesn't work.
Before:

$ rm -rf build && cmake -B build | grep ggml-cpu
-- Adding CPU backend variant ggml-cpu:  

After:

$ rm -rf build && cmake -B build | grep ggml-cpu
-- Adding CPU backend variant ggml-cpu: -march=native

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 10, 2024
if (NOT USER_PROVIDED_MARCH)
list(APPEND ARCH_FLAGS "-march=native")
endif()
else()
Copy link
Contributor Author

@angt angt Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting all the old code in a else might be too drastic but I guess the other cases are only relevant when cross-compiling.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the other code in fact seems to be doing the same thing that -march=native would do. GGML_NATIVE disabled should generate a consistent build depending on the flags specified during compilation, which is not the case at the moment.

This needs to be completely revamped, and as it is, this PR is just adding to the mess that will need to be cleaned up later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can help with revamping but I need some clarification first :)

Today, building on generic ARM gives very poor performance because the build system completely ignores the GGML_NATIVE directive, so I just aligned the current code with the current description of GGML_NATIVE found in ggml/CMakeLists.txt :

option(GGML_NATIVE "ggml: enable -march=native flag" ${GGML_NATIVE_DEFAULT})

I completely agree that it's not the best way to get performance but it's better than nothing and it already fixes many modern setups.

So, do we want to relax the definition of GGML_NATIVE and allow to use, for example, -mcpu=native on ARM which would be much better for performance?

The old code was clearly aimed at small devices, like android and raspberry and also used CMAKE_SYSTEM_PROCESSOR so for me it wasn't used as a way to fix -march=native at all but rather as a way to find acceptable flags when cross-compiling and in this case you really don't want GGML_NATIVE (hence the move to else).

Maybe @ggerganov has some memories to share about that ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe these flags were mostly set by trial and error, back when we were running whisper.cpp on some raspberries. But this is very likely wrong as I didn't really understand the specifics and should be revamped. I'm not really an expert and I still get quite confused with all the different Arm architectures, so whatever you think makes sense to improve this is welcome. I can test changes on the entire spectrum of Apple Silicon if necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, with gcc/clang it is enough to set the correct architecture flags with -march, and -match=native should work in the same way as x86. The exception is likely to be MSVC once again, because it does not set the preprocessor definitions of the enabled ARM features. In which case, we may consider just dropping support for MSVC with ARM entirely, because it is a constant source of problems, doesn't work with the inline asm kernels, and doesn't really add anything beyond clang or possibly MINGW.

I believe this should work:

  • Set -march=native if GGML_NATIVE is enabled
  • Add a parameter GGML_CPU_ARCH to the build to set the architecture, so that if GGML_NATIVE is disabled and this parameter is provided, then -march=${GGML_CPU_ARCH} is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On ARM, to build for local use (i.e. using GGML_NATIVE), -mcpu=native alone should be the best option as far as I know. The -march=native will often miss some opportunities. The -mtune=native will optimize for the current microarchitecture (so still not fully optimized for the cpu).

So I think redefining GGML_NATIVE to something like "Try to optimize builds for the current cpu" and using -march=native on x86_64 and -mcpu=native on ARM would already be much simpler and an improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this sounds good to you, I can adapt this PR in this direction so we can see how it works in practice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good.

@angt
Copy link
Contributor Author

angt commented Dec 10, 2024

This CI/CD error is not clear to me:

CPY(type_src=f32,type_dst=q5_1,ne=[256,2,3,4],permute=[0,2,1,3]): [CPY] NMSE = 0.000001799 > 0.000001000 FAIL

My M3 laptop passes all tests:

$ ctest               
Test project /Users/angt/llama.cpp/build
      Start  1: test-tokenizer-0-bert-bge
 1/30 Test  #1: test-tokenizer-0-bert-bge .........   Passed    0.04 sec
      Start  2: test-tokenizer-0-command-r
 2/30 Test  #2: test-tokenizer-0-command-r ........   Passed    0.31 sec
      Start  3: test-tokenizer-0-deepseek-coder
 3/30 Test  #3: test-tokenizer-0-deepseek-coder ...   Passed    0.06 sec
      Start  4: test-tokenizer-0-deepseek-llm
 4/30 Test  #4: test-tokenizer-0-deepseek-llm .....   Passed    0.12 sec
      Start  5: test-tokenizer-0-falcon
 5/30 Test  #5: test-tokenizer-0-falcon ...........   Passed    0.08 sec
      Start  6: test-tokenizer-0-gpt-2
 6/30 Test  #6: test-tokenizer-0-gpt-2 ............   Passed    0.06 sec
      Start  7: test-tokenizer-0-llama-bpe
 7/30 Test  #7: test-tokenizer-0-llama-bpe ........   Passed    0.24 sec
      Start  8: test-tokenizer-0-llama-spm
 8/30 Test  #8: test-tokenizer-0-llama-spm ........   Passed    0.04 sec
      Start  9: test-tokenizer-0-mpt
 9/30 Test  #9: test-tokenizer-0-mpt ..............   Passed    0.07 sec
      Start 10: test-tokenizer-0-phi-3
10/30 Test #10: test-tokenizer-0-phi-3 ............   Passed    0.03 sec
      Start 11: test-tokenizer-0-qwen2
11/30 Test #11: test-tokenizer-0-qwen2 ............   Passed    0.16 sec
      Start 12: test-tokenizer-0-refact
12/30 Test #12: test-tokenizer-0-refact ...........   Passed    0.07 sec
      Start 13: test-tokenizer-0-starcoder
13/30 Test #13: test-tokenizer-0-starcoder ........   Passed    0.07 sec
      Start 14: test-tokenizer-1-llama-spm
14/30 Test #14: test-tokenizer-1-llama-spm ........   Passed    0.15 sec
      Start 15: test-log
15/30 Test #15: test-log ..........................   Passed    0.01 sec
      Start 16: test-arg-parser
16/30 Test #16: test-arg-parser ...................   Passed    0.02 sec
      Start 17: test-sampling
17/30 Test #17: test-sampling .....................   Passed    0.74 sec
      Start 18: test-chat-template
18/30 Test #18: test-chat-template ................   Passed    0.00 sec
      Start 19: test-grammar-parser
19/30 Test #19: test-grammar-parser ...............   Passed    0.00 sec
      Start 20: test-grammar-integration
20/30 Test #20: test-grammar-integration ..........   Passed    0.01 sec
      Start 21: test-llama-grammar
21/30 Test #21: test-llama-grammar ................   Passed    0.00 sec
      Start 22: test-backend-ops
22/30 Test #22: test-backend-ops ..................   Passed   29.79 sec
      Start 23: test-model-load-cancel
23/30 Test #23: test-model-load-cancel ............   Passed    0.24 sec
      Start 24: test-autorelease
24/30 Test #24: test-autorelease ..................   Passed    0.17 sec
      Start 25: test-barrier
25/30 Test #25: test-barrier ......................   Passed    0.23 sec
      Start 26: test-quantize-fns
26/30 Test #26: test-quantize-fns .................   Passed   15.27 sec
      Start 27: test-quantize-perf
27/30 Test #27: test-quantize-perf ................   Passed    0.03 sec
      Start 28: test-rope
28/30 Test #28: test-rope .........................   Passed    0.02 sec
      Start 29: test-json-schema-to-grammar
29/30 Test #29: test-json-schema-to-grammar .......   Passed    1.83 sec
      Start 30: test-eval-callback
30/30 Test #30: test-eval-callback ................   Passed    0.42 sec

100% tests passed, 0 tests failed out of 30

Label Time Summary:
curl             =   0.42 sec*proc (1 test)
eval-callback    =   0.42 sec*proc (1 test)
main             =  49.47 sec*proc (27 tests)
model            =   0.41 sec*proc (2 tests)

Total Test time (real) =  50.32 sec

If anyone has any ideas on what I could do to reproduce the error and try to fix it, I'm interested :)

@angt angt force-pushed the ggml-allow-march-native-on-generic-arm-platforms branch from 195499d to 806755f Compare December 11, 2024 15:05
@slaren
Copy link
Collaborator

slaren commented Dec 11, 2024

I have no idea what's going on with the ARM build flags. Why is it handled differently for Apple than for other ARM platforms? It would be great if all of this could be cleaned up and unified into a single branch for ARM in the build.

@slaren
Copy link
Collaborator

slaren commented Dec 11, 2024

CPY(type_src=f32,type_dst=q5_1,ne=[256,2,3,4],permute=[0,2,1,3]): [CPY] NMSE = 0.000001799 > 0.000001000 FAIL

You can ignore this error.

@angt
Copy link
Contributor Author

angt commented Dec 11, 2024

I have no idea what's going on with the ARM build flags. Why is it handled differently for Apple than for other ARM platforms? It would be great if all of this could be cleaned up and unified into a single branch for ARM in the build.

I totally agree the ARM section could be much simpler, here I'm just fixing cmake to make it work like the Makefile (soon deprecated a priori) as a quick win.

@slaren
Copy link
Collaborator

slaren commented Dec 11, 2024

If I recall correctly, -march=native is not used on ARM because in some cases the feature definitions (eg. __ARM_FEATURE_MATMUL_INT8) were not being added by the compiler, so they would not be used regardless. I believe this is why there is code to test the different flags.

@angt
Copy link
Contributor Author

angt commented Dec 11, 2024

If I recall correctly, -march=native is not used on ARM because in some cases the feature definitions (eg. __ARM_FEATURE_MATMUL_INT8) were not being added by the compiler, so they would not be used regardless. I believe this is why there is code to test the different flags.

My experience is that --march=native alone is pretty good but trying to mix it with -mtune break everything (which is common from x86_64 users).

Here I fix the GGML_NATIVE for arm which I think can be useful in many cases. And the real issue is that it is enabled by default for arm, so maybe we can disable it by default for this platform and let people decide to use it or not ?

@angt angt force-pushed the ggml-allow-march-native-on-generic-arm-platforms branch 2 times, most recently from 69361bf to 3215005 Compare December 16, 2024 14:57
@angt angt changed the title ggml: allow -march=native on generic ARM platforms ggml: GGML_NATIVE uses -mcpu=native on ARM Dec 16, 2024
@angt angt force-pushed the ggml-allow-march-native-on-generic-arm-platforms branch from 3215005 to 3fdfeb1 Compare December 16, 2024 15:16
@angt
Copy link
Contributor Author

angt commented Dec 17, 2024

I have updated the PR to use -mcpu=native alone on ARM with GGML_NATIVE:

Apple M3:

-- ARM detected
-- ARM feature DOTPROD enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native 

AWS graviton4 (ubuntu):

-- ARM detected
-- ARM feature DOTPROD enabled
-- ARM feature SVE enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, M1 Pro now also works correctly:

-- ARM detected
-- ARM feature DOTPROD enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native

@slaren
Copy link
Collaborator

slaren commented Dec 17, 2024

This is good, but the GGML_NATIVE disabled path still needs to be fixed.

@angt
Copy link
Contributor Author

angt commented Dec 17, 2024

This is good, but the GGML_NATIVE disabled path still needs to be fixed.

Totally agree but I need to find some devices to test before touching this code.

In the meantime, I don't think the current code brings any serious regression:

  • on classic ARM (aarch64) the old code won't have any impact. I don't think the FP16_FORMAT_I3E test will cause any problems.
  • GGML_SVE is now only used if GGML_NATIVE is disabled, but that makes sense.
  • when cross-compiling, GGML_NATIVE is automatically deactivated and we fallback to the old code as expected.

So I think the risk is fairly low in the current state and we fix some issues.

@slaren
Copy link
Collaborator

slaren commented Dec 17, 2024

The problem with the GGML_NATIVE disabled path is that it tries to do what GGML_NATIVE already does, so it is fundamentally wrong. The fix is not complicated, just remove the current code and add an option so that the user can choose the value to pass to -march, as outlined earlier.

ggml/CMakeLists.txt Outdated Show resolved Hide resolved
@angt
Copy link
Contributor Author

angt commented Dec 17, 2024

The problem with the GGML_NATIVE disabled path is that it tries to do what GGML_NATIVE already does, so it is fundamentally wrong. The fix is not complicated, just remove the current code and add an option so that the user can choose the value to pass to -march, as outlined earlier.

It tries to do the same thing but from a different system than the local one which is why it is the only place where CMAKE_SYSTEM_PROCESSOR is used and not CMAKE_HOST_SYSTEM_PROCESSOR. And sadly -march and -mcpu won't be enough in that case.

Another possibility is to remove the code completely, as ggml is a library, and then add *.cmake files to llama.cpp and whisper.cpp for cross-compilation for raspberry pi and adnroid, adding all the necessary flags.

@angt angt force-pushed the ggml-allow-march-native-on-generic-arm-platforms branch from 7955eb1 to 1dae1d8 Compare December 18, 2024 08:14
@angt
Copy link
Contributor Author

angt commented Dec 18, 2024

Naively building on graviton4 with and without this PR:

| Model         | Test   |   t/s master |   t/s 57498e73 |   Speedup |
|:--------------|:-------|-------------:|---------------:|----------:|
| llama 1B Q4_0 | pp512  |       184.38 |        1149.10 |      6.23 |
| llama 1B Q4_0 | tg128  |       107.76 |         175.41 |      1.63 |

@slaren
Copy link
Collaborator

slaren commented Dec 18, 2024

-mcpu=native does not enable __ARM_FEATURE_MATMUL_INT8 on M3 Max. @ggerganov you mentioned that it worked correctly on M1 Pro, did you check if it also enables all the features in the M2/M4?

@ggerganov
Copy link
Owner

Hm, you are correct - it does not detect the MMI8 feature neither on M2 Ultra nor M4 Max:

20:32:21 ▶ master ▶ 81⎘ ▶ $ ▶ git-pr 10752
remote: Enumerating objects: 13, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 13 (delta 8), reused 8 (delta 8), pack-reused 3 (from 1)
Unpacking objects: 100% (13/13), 7.22 KiB | 616.00 KiB/s, done.
From https://github.com/ggerganov/llama.cpp
 * [new ref]             refs/pull/10752/head -> pr/10752
Switched to branch 'pr/10752'
 ggerganov ▶ gg-studio ▶ ~/development/github/llama.cpp ▶
20:32:32 ▶ pr/10752 ▶ 81⎘ ▶ cmake -B build-arm
-- The C compiler identification is AppleClang 16.0.0.16000026
-- The CXX compiler identification is AppleClang 16.0.0.16000026
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /opt/homebrew/bin/git (found version "2.41.0") 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: arm64
-- Including CPU backend
-- Accelerate framework found
-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES) 
-- Could NOT find OpenMP_CXX (missing: OpenMP_CXX_FLAGS OpenMP_CXX_LIB_NAMES) 
-- Could NOT find OpenMP (missing: OpenMP_C_FOUND OpenMP_CXX_FOUND) 
CMake Warning at ggml/src/ggml-cpu/CMakeLists.txt:53 (message):
  OpenMP not found
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:298 (ggml_add_cpu_backend_variant_impl)


-- ARM detected
-- ARM feature DOTPROD enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native 
-- Looking for dgemm_
-- Looking for dgemm_ - found
-- Found BLAS: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/Accelerate.framework  
-- BLAS found, Libraries: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/Accelerate.framework
-- BLAS found, Includes: 
-- Including BLAS backend
-- Metal framework found
-- The ASM compiler identification is AppleClang
-- Found assembler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Including METAL backend
-- Configuring done (1.1s)
-- Generating done (0.4s)

@slaren
Copy link
Collaborator

slaren commented Dec 18, 2024

I am continuing this in #10890.

@slaren slaren closed this Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants