Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can Quest run on Apple M1 ? #301

Open
keithyau opened this issue Sep 27, 2021 · 10 comments
Open

Can Quest run on Apple M1 ? #301

keithyau opened this issue Sep 27, 2021 · 10 comments

Comments

@keithyau
Copy link

Wondering if llvm / Clang can be supported Apple M1

@TysonRayJones
Copy link
Member

Hi there,

I don't have an M1 handy to test, but certainly there's nothing special in the QuEST architecture to preclude it.
I would confidently assume that serial QuEST is supported by whatever the M1 compiling chain is.

For multithreading; QuEST supports OpenMP versions 2.0 (in develop - the master branch temporarily requires 3.1) through to OpenMP 5.0 (the latest). It is not yet tested with 5.1, but is expected compatible. Mature releases of Clang support OpenMP (e.g. OpenMP 4.5 in Clang 13). If the M1 compiling chain fully supports clang, then I expect QuEST to compile fine.

But one never knows until they test!

@keithyau
Copy link
Author

thank you !

@mmoelle1
Copy link

Hi there,

I tried compiling QuEST on an M1 and it works. However, it needs some modification of the CMakeLists.txt file.

Original (same for C++ compiler):

# TODO standardize
# set C compiler flags based on compiler type
if ("${CMAKE_C_COMPILER_ID}" STREQUAL "Clang")
  # using Clang
  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} \
    -mavx -Wall"
  )
elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "GNU")
  # using GCC
  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} \
    -mavx -Wall"
  )
elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "Intel")
  # using Intel
  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} \
    -fprotect-parens -Wall -xAVX -axCORE-AVX2 -diag-disable cpu-dispatch"
  )
elseif ("${CMAKE_C_COMPILER_ID}" STREQUAL "MSVC")
  # using Visual Studio
  string(REGEX REPLACE "/W3" "" CMAKE_C_FLAGS ${CMAKE_C_FLAGS})
  string(REGEX REPLACE "-W3" "" CMAKE_C_FLAGS ${CMAKE_C_FLAGS})
  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} \
    -w"
  )
endif()

Apple's default compiler reports itself as AppleClang so by accident no -mavx flag is set which does not work on M1. However, when you install a true GCC (e.g. using homebrew), the about detects a GNU compiler and sets -mavx which leads to a compiler error. The same problem happens on any non-x86_64 architecture (ARM/ARM64, PPC).

As a quick fix I'd suggest to wrap the entire if()...endif() block in if (CMAKE_SYSTEM_PROCESSOR MATCHES "(x86)|(X86)|(amd64)|(AMD64)") ... endif() which will disable it for any non-x86_64 architecture.

@TysonRayJones
Copy link
Member

Hi Matthias,
That's really useful to know, thanks very much!
I've been meaning to test whether QuEST can meaningfully utilise auto-vectorisation for a while, so I'll add that to my backlog and update the build afterward (or remove the flag entirely). @rrmeister who has a better understanding of the CMake build may also be interested.
Thanks again!

@ekapit
Copy link

ekapit commented Feb 15, 2022

I just got a new M1 Max laptop, and am trying out QuEST on it. Naively, it should be extremely fast-- this CPU has 10 cores and 200+ GB/s usable memory bandwidth, higher than most Xeons, and since that's the primary bottleneck it should be very quick. And I was able to get Apple clang to link to openMP correctly, so it is multithreaded. However when trying it out it ends up being much slower than on intel chips. I tried setting "march=apple-m1" as a compiler flag to make sure it's compiling native code but that didn't seem to change anything. I strongly suspect this is a compiler issue, though I'm not sure what to try next.

Has anyone gotten QuEST to perform well on Apple Silicon?

@TysonRayJones
Copy link
Member

Hi ekapit,

Hmm that's quite puzzling. I've created a very simple MWE below which modifies a complex array much like QuEST's backend CPU code.

Let's first test if your laptop is performing as expected for a serial simulation.
Can you copy the code below into a file (e.g. github_issue.c), and compile it serially using -O3 optimisation, and whatever additional arguments you need to target M1?

On my 13-inch Macbook, I compiled via

clang github_issue.c -O3 -o test

using clang-10. It ran (./test) in 12s.

In what time does your M1 laptop run?

MWE

/* compile as...
 *  serial:
 *      clang github_issue.c -O3 -o test
 *  multithreaded:
 *      clang github_issue.c -O3 -openmp -o test
 *
 * run as...
 *     export OMP_NUM_THREADS=1
 *     ./test
 *
 * Memory cost = 16 * 2^numQb (bytes)
 *      20 qubits = 16 MiB
 *      28 qubits = 4 GiB
 *
 * Serial simulation of 28 qubits on my 13-inch Macbook Pro,
 * compiled with clang-1000.10.44.2:
 *      12.133904 (s)
 */

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <complex.h>
#include <sys/time.h>

#ifdef _OPENMP
#include <omp.h>
#endif

#define START_TIMING() \
    struct timeval tval_before, tval_after, tval_result; \
    gettimeofday(&tval_before, NULL);
    
#define STOP_TIMING() \
    gettimeofday(&tval_after, NULL); \
    timersub(&tval_after, &tval_before, &tval_result); \
    printf("%ld.%06ld (s)\n", \
        (long int) tval_result.tv_sec, \
        (long int) tval_result.tv_usec);

typedef long long unsigned int INDEX;

typedef double complex AMP;

void applyGate(AMP* amps, int t, int numQb) {
    
    const double fac = 1/sqrt(2);
    const INDEX iNum = (1ULL << numQb) >> 1;
    
#ifdef _OPENMP
#pragma omp parallel \
    default  (none) \
    shared   (amps,t,numQb, fac,iNum) \
    private  (i,j,j0k,j1k,a1,a2)
#endif
    {
#ifdef _OPENMP
#pragma omp for schedule (static)
#endif
        for (INDEX i=0; i<iNum; i++) {
            
            // |0>|i> -> |j>|0>|k>, |j>|1>|k>
            INDEX j = (i >> t) << t;
            INDEX j0k = (j << 1ULL) ^ (i - j);
            INDEX j1k = j0k ^ (1ULL << t);
                    
            AMP a1 = amps[j0k];
            AMP a2 = amps[j1k];
            amps[j0k] = fac*a1 + fac*a2;
            amps[j1k] = fac*a1 - fac*a2;
        }
    }
}



int main() {
    
    int numQb = 28;
    
    INDEX numAmp = (1ULL<<numQb);
    AMP* amps = malloc(numAmp * sizeof *amps);
    for (INDEX i=0; i<numAmp; i++)
        amps[i] = 1./i + 2.*I/i;
    
    START_TIMING()
    
    for (int t=0; t<numQb; t++)
        applyGate(amps, t, numQb);
        
    STOP_TIMING()
    
    free(amps);
    return 0;
}

@mmoelle1
Copy link

Hi Tyson,

I tried you code on my Apple M1 (MacBook Pro) not the M1 Max or Pro as the OP.

Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: arm64-apple-darwin21.3.0

Serial

❯ clang github_issue.c -O3 -o test
8.559273 (s)

OpenMP

❯ clang github_issue.c -O3 -openmp -o test
7.743996 (s) OMP_NUM_THREADS=1
4.227490 (s) OMP_NUM_THREADS=2
4.195969 (s) OMP_NUM_THREADS=4
4.211792 (s) OMP_NUM_THREADS=8

GCC 11.2.0.3 (from home-brew)

Serial

7.596629 (s)

OpenMP

7.835089 (s) OMP_NUM_THREADS=1
5.348674 (s) OMP_NUM_THREADS=2
5.083343 (s) OMP_NUM_THREADS=4
5.096947 (s) OMP_NUM_THREADS=8

For GCC the line private (i,j,j0k,j1k,a1,a2) needs to be removed.

@TysonRayJones
Copy link
Member

Thanks very much Matthias! (and oops regarding GCC; I forgot we have to pre-declare our OpenMP variables there like filthy animals).

Those are encouraging times, which to me confirm ekapit's performance issues are indeed related to build parameters, as we discussed above. Or maybe we're comparing to some very impressive Intel chips! :)

@fieldofnodes
Copy link

Hi, I have an M1 Max macbook pro and I just added #346 to this as I can not get QuEST to make for testing.

@TysonRayJones
Copy link
Member

Confirming QuEST v4 (due for release mid-September) runs fine on an M1 Mac (which is now my main development machine!), with a naive build. We'll make sure our revised CMake build avoids the above issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants