From CUDA only to CUDA+HIP #768

brucefan1983 · 2024-10-23T18:42:26Z

Enable HIP compiling for running in AMD GPUs.

make # for CUDA
make -f makefile.hip # for HIP

elindgren

This is really cool! What performance hit for HIP vs CUDA can we expect? From the issue #404 it seems like the HIP version performs quite a bit worse than expected?

src/force/ilp_tmd_sw.cu

src/utilities/gpu_macro.cuh

elindgren · 2024-10-24T06:42:16Z

@Dankomaister in #404 you also mentioned that MD was not working, and that the "results are incorrect". What results were incorrect, and has these issues been resolved?

src/utilities/gpu_macro.cuh

jesperbygg · 2024-10-24T13:30:11Z

Thanks for the effort, getting this working smoothly would be fantastic!

I managed to compile on LUMI (AMD MI250X) after changing hipDeviceProp to hipDeviceProp_t as commented above. After running some quick tests, I have very similar conclusions as @Dankomaister in #404:

NEP training runs and produces sensible results, although slower than expected. For my test case, 1 GCD (half a MI250X GPU) runs 4x slower than on a V100 (CSC Puhti cluster).
MD segfaults after Started executing the commands in run.in., regardless of potential.

brucefan1983 · 2024-10-24T13:55:20Z

Thanks for the effort, getting this working smoothly would be fantastic!

I managed to compile on LUMI (AMD MI250X) after changing hipDeviceProp to hipDeviceProp_t as commented above. After running some quick tests, I have very similar conclusions as @Dankomaister in #404:

NEP training runs and produces sensible results, although slower than expected. For my test case, 1 GCD (half a MI250X GPU) runs 4x slower than on a V100 (CSC Puhti cluster).

MD segfaults after Started executing the commands in run.in., regardless of potential.

MD segfaults sound like problem in CPU?

jesperbygg · 2024-10-25T08:21:35Z

Thanks for the effort, getting this working smoothly would be fantastic!
I managed to compile on LUMI (AMD MI250X) after changing hipDeviceProp to hipDeviceProp_t as commented above. After running some quick tests, I have very similar conclusions as @Dankomaister in #404:

NEP training runs and produces sensible results, although slower than expected. For my test case, 1 GCD (half a MI250X GPU) runs 4x slower than on a V100 (CSC Puhti cluster).

MD segfaults after Started executing the commands in run.in., regardless of potential.

MD segfaults sound like problem in CPU?

After some more investigation:

Compiling with -ggdb and running gpumd through the rocgdb debugger, the segfault details are:

Thread 1 "gpumd" received signal SIGSEGV, Segmentation fault. 0x000000000086d822 in is_valid_int (s=0x1002 <error: Cannot access memory at address 0x1002>, result=result@entry=0x7fffffff4004) at utilities/read_file.cu:27 27 if (s == NULL || *s == '\0') {

Compiling with -O0 or -O1 instead of -O3 works! -O2 does not work. With -O1, gpumd runs on one MI250X GCD with no errors ~4-5x slower than on a V100 (which is actually similar to my nep training comparison with -O3).
The compiler gives a lot of "loop not unrolled" warnings, but they appear also for -O1 which works:
warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]

This is all on the LUMI cluster with HIP version: 6.0.32831-204d35d16 and rocm-6.0.3 with module load PrgEnv-amd rocm craype-x86-trento craype-accel-amd-gfx90a

brucefan1983 · 2024-10-25T16:44:42Z

The "loop not unrolled" warnings should be gone now. For the segfaults, I have tried to use O1 for the code in the relevant folder and still O3 for other folders. Not sure if this help.

We have a plan to rewrite this part using modern C++, so it will be ok in the future.

Dankomaister · 2024-10-26T01:10:11Z

Nice to see that HIP support is getting added! great job @brucefan1983.
Once Dardel gets back up and running I will do some more testing on this.

The issue I mentioned in #404 (except for the segmentation fault) was that running MD on multiple GPUs would result in incorrect results i.e. forces not calculated properly. Not sure if this is still an issue? @jesperbygg did you try running MD with multiple GPUs?

Dankomaister · 2024-10-26T13:48:50Z

Ok I tried to compile the HIP version on Dardel using ROCm 5.7.0 (the latest version available on Dardel).
Training works fine but MD still gives segmentation fault…

I tried adding -ggdb and run GPUMD through the rocgdb debugger as @jesperbygg did and the error is still in read_file.cu

Thread 1 "gpumd" received signal SIGSEGV, Segmentation fault.
is_valid_int (s=0x1002 <error: Cannot access memory at address 0x1002>,
    result=result@entry=0x7fffffff81d4) at utilities/read_file.cu:29
29        } else if (*s == '\0') {

Changing -O1 to -O0 for CFLAGS_O1 solved the segmentation fault however GPUMD still crashes with this error

Started executing the commands in run.in.
Input Error:
    File:       main_gpumd/run.cu
    Line:       510
    Error text: seed should be a positive integer.

Adding a seed to the velocity command solved that problem and MD works!
Changing back to -O1 for CFLAGS_O1 also works so maybe there is a problem with parsing the velocity command?

I also tried to run MD on multiple GPUs however this still gives segmentation fault with

Thread 4 "gpumd" received signal SIGSEGV, Segmentation fault.
[Switching to thread 4, lane 0 (AMDGPU Lane 1:6:1:14/0 (13,0,0)[0,0,0])]
find_descriptor () at force/nep3_multigpu.cu:970
970               gn12 += fn12[k] * annmb.c[c_index];

So MD on multiple GPU still requires some work it seems.

brucefan1983 · 2024-10-27T21:11:01Z

@Dankomaister Thank you for your further tests. The error message seed should be a positive integer is puzzling to me, as it only shows up when a seed is given for the velocity command, but you got it before giving the seed.

brucefan1983 · 2024-10-27T21:25:14Z

I have tested CUDA version and the C++ optimization level does not affect the overall speed at all. So I changed the C++ optimization level to O1 for HIP globally. If needed it can also be changed to O0 . @Dankomaister Jesper has mentioned that using O0 and O1 globally for HIP worked. Do you want to have a look at this option?

Dankomaister · 2024-10-28T08:48:44Z

OK I will try with using -O0 and -O1 for for HIP globally.
However I think the problem is not with is_valid_int.
Since it is only supposed to be called in parse_velocity if num_param == 4.
But since it somehow gets called when using velocity 1234.0 there must be a problem elsewhere.
Perhaps in determining num_param ?

jesperbygg · 2024-10-28T11:48:21Z

For me on LUMI, compiling everything with -O1 as in the latest commit works without segmentation faults or errors (I don't specify any seed for velocity). I now also tried on multiple GPUs. It runs, but like @Dankomaister noticed previously it seems like the results are wrong. The potential energy for a test NEP is some 10 meV/atom different on 2 GCDs compared to 1 GCD, and even though I didn't check I assume the forces are not consistent because the energy is not conserved well on 2 GCDs.

Dankomaister · 2024-10-28T13:53:17Z

More testing, compiling everything with -O1 as in the latest commit works with and without seed.
Compiling everything with -O2 or -O3 gives segmentation fault without seed but works with seed!
These are the MD run times for the different optimization levels:
-O3: 30.90 s
-O2: 30.44 s
-O1: 36.24 s
-O0: -
note -O0 does not compile with error error: ran out of registers during register allocation.

I tried to figure out what happens during parsing of the velocity command that causes the error with -O2 and -O3.
By simply modifying the parse_velocity function to print the value of num_param

void Run::parse_velocity(const char** param, int num_param)
{
  printf("\nvel0 num_param: %d\n", num_param);
  fflush(stdout);
  int seed;
  bool use_seed = false;
  printf("\nvel1 num_param: %d\n", num_param);
  fflush(stdout);
  if (!(num_param == 2 || num_param == 4)) {
    PRINT_INPUT_ERROR("velocity should have 1 or 2 parameters.\n");
  }
  printf("\nvel2 num_param: %d\n", num_param);
  fflush(stdout);
  if (!is_valid_real(param[1], &initial_temperature)) {
    PRINT_INPUT_ERROR("initial temperature should be a real number.\n");
  }
  printf("\nvel3 num_param: %d\n", num_param);
  fflush(stdout);
  if (initial_temperature <= 0.0) {
    PRINT_INPUT_ERROR("initial temperature should be a positive number.\n");
  }
  printf("\nvel4 num_param: %d\n", num_param);
  fflush(stdout);
  if (num_param == 4) {
    printf("\nvel5 num_param: %d\n", num_param);
    fflush(stdout);
    use_seed = true;
    if (!is_valid_int(param[3], &seed)) {
      PRINT_INPUT_ERROR("seed should be a positive integer.\n");
    }
  }
  velocity.initialize(
    has_velocity_in_xyz,
    initial_temperature,
    atom.cpu_mass,
    atom.cpu_position_per_atom,
    atom.cpu_velocity_per_atom,
    atom.velocity_per_atom,
    use_seed,
    seed);
  if (!has_velocity_in_xyz) {
    printf("Initialized velocities with input T = %g K.\n", initial_temperature);
  }
}

I get the following output when running GPUMD

...
Use Nose-Hoover thermostat and Parrinello-Rahman barostat.
Use NPT ensemble for this run.
Thermostat: t_start is 4000.000000, t_stop is 1000.000000, t_period is 100.000000 timesteps
xx : p_start is 0.000000, p_stop is 0.000000, p_period is 1000.000000 timesteps
xy will not be changed.
xz will not be changed.
yx will not be changed.
yy : p_start is 0.000000, p_stop is 0.000000, p_period is 1000.000000 timesteps
yz will not be changed.
zx will not be changed.
zy will not be changed.
zz : p_start is 0.000000, p_stop is 0.000000, p_period is 1000.000000 timesteps

vel0 num_param: 2

vel1 num_param: 2

vel2 num_param: 2

vel3 num_param: 2

vel4 num_param: 2

vel5 num_param: 4

Thread 1 "gpumd" received signal SIGSEGV, Segmentation fault.
is_valid_int (s=0x1002 <error: Cannot access memory at address 0x1002>,
    result=result@entry=0x7fffffff81d4) at utilities/read_file.cu:29
29        } else if (*s == '\0') {

I cannot understand why but somehow the value of num_param gets changed to 4 ?! very strange...

Dankomaister · 2024-10-28T14:10:09Z

Hmm ok changing the if statements in parse_velocity to the following

void Run::parse_velocity(const char** param, int num_param)
{
  int seed;
  bool use_seed = false;
  
  if (!(num_param == 2 || num_param == 4)) {
    PRINT_INPUT_ERROR("velocity should have 1 or 2 parameters.\n");
  } else if (num_param == 4) {
    use_seed = true;
    if (!is_valid_int(param[3], &seed)) {
      PRINT_INPUT_ERROR("seed should be a positive integer.\n");
    }
  }
  
  if (!is_valid_real(param[1], &initial_temperature)) {
    PRINT_INPUT_ERROR("initial temperature should be a real number.\n");
  }
  if (initial_temperature <= 0.0) {
    PRINT_INPUT_ERROR("initial temperature should be a positive number.\n");
  }
  
  velocity.initialize(
    has_velocity_in_xyz,
    initial_temperature,
    atom.cpu_mass,
    atom.cpu_position_per_atom,
    atom.cpu_velocity_per_atom,
    atom.velocity_per_atom,
    use_seed,
    seed);
  if (!has_velocity_in_xyz) {
    printf("Initialized velocities with input T = %g K.\n", initial_temperature);
  }
}

seems to fix the problem... somehow... MD now runs with -O2 and -O3 :)

brucefan1983 · 2024-10-28T21:41:31Z

It's so mysterious! How about if you add const to int?

void Run::parse_velocity(const char** param, const int num_param)

brucefan1983 · 2024-10-28T22:20:21Z

@Dankomaister

Indeed, the seed option was introduced in GPUMD-3.9, and you found HIP works for GPUMD-3.8. So it is indeed related to this parse_velocity function.

I just tried to initialize seed to a value. Note sure if this help.

brucefan1983 · 2024-10-28T22:34:43Z

@Dankomaister If there is no better way, I will change the code following your fix, and refer to this page in comments. We can investigate multi-GPU later.

brucefan1983 · 2024-10-30T21:06:15Z

I have followed your cod to change this part @Dankomaister
I will merge this PR and study the multi-GPU version of HIP when I have an environment.

brucefan1983 added 13 commits October 24, 2024 01:56

change cudaMalloc and cudaFree

6b0a062

change cudaMemcpy

a841d35

more cuda to gpu

0ad0cd3

device cuda to gpu

12b1e7c

deviceSynchronize done

707dd3b

memset cuda to gpu

3d3736a

stream cuda to gpu

bc5fcb7

thrust cuda to gpu

cb87370

curand to gpurand

ee8859f

blas cuda to gpu

8c3ce73

remove unused header

c2c1892

lapack cuda to gpu

c7031c0

add hip makefile

cb43c43

brucefan1983 marked this pull request as ready for review October 23, 2024 21:49

enable hip

785f9ba

brucefan1983 requested review from shdchen, elindgren, erhart1 and yongwangxx October 23, 2024 21:53

elindgren reviewed Oct 24, 2024

View reviewed changes

src/force/ilp_tmd_sw.cu Outdated Show resolved Hide resolved

src/utilities/gpu_macro.cuh Show resolved Hide resolved

jesperbygg reviewed Oct 24, 2024

View reviewed changes

src/utilities/gpu_macro.cuh Outdated Show resolved Hide resolved

brucefan1983 added 4 commits October 24, 2024 19:26

merge master

2511a9e

fix a macro for hip

6fa70a8

GPU_KERNEL_CHECK

83b67f1

better include

8e9f570

add more headers

364fa28

brucefan1983 added 5 commits October 25, 2024 17:43

remove pragma unroll

f023bb3

try to force to check NULL

99fe2bf

another istance of NULL

fdc19ea

O1 optimization level for part of the code

b06796a

simplify makefile for hip

9fc234e

brucefan1983 added 3 commits October 26, 2024 00:57

merge master

f33417d

restore makefile for cuda

9fe7f4f

report success message for hip makefile

d42f11a

brucefan1983 added 2 commits October 28, 2024 05:20

O1 for hip

1ac1a4c

merge master

e39e6f0

initialize a variable

816bd74

brucefan1983 added 2 commits October 31, 2024 04:16

CHECK

dbc91ed

follow Daniel suggestion

96e2b59

brucefan1983 merged commit aad40ec into master Oct 30, 2024
2 checks passed

brucefan1983 deleted the hip branch October 30, 2024 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

From CUDA only to CUDA+HIP #768

From CUDA only to CUDA+HIP #768

brucefan1983 commented Oct 23, 2024 •

edited

Loading

elindgren left a comment •

edited

Loading

elindgren commented Oct 24, 2024

jesperbygg commented Oct 24, 2024

brucefan1983 commented Oct 24, 2024

jesperbygg commented Oct 25, 2024

brucefan1983 commented Oct 25, 2024

Dankomaister commented Oct 26, 2024

Dankomaister commented Oct 26, 2024

brucefan1983 commented Oct 27, 2024

brucefan1983 commented Oct 27, 2024

Dankomaister commented Oct 28, 2024

jesperbygg commented Oct 28, 2024

Dankomaister commented Oct 28, 2024

Dankomaister commented Oct 28, 2024

brucefan1983 commented Oct 28, 2024

brucefan1983 commented Oct 28, 2024 •

edited

Loading

brucefan1983 commented Oct 28, 2024 •

edited

Loading

brucefan1983 commented Oct 30, 2024

From CUDA only to CUDA+HIP #768

From CUDA only to CUDA+HIP #768

Conversation

brucefan1983 commented Oct 23, 2024 • edited Loading

elindgren left a comment • edited Loading

Choose a reason for hiding this comment

elindgren commented Oct 24, 2024

jesperbygg commented Oct 24, 2024

brucefan1983 commented Oct 24, 2024

jesperbygg commented Oct 25, 2024

brucefan1983 commented Oct 25, 2024

Dankomaister commented Oct 26, 2024

Dankomaister commented Oct 26, 2024

brucefan1983 commented Oct 27, 2024

brucefan1983 commented Oct 27, 2024

Dankomaister commented Oct 28, 2024

jesperbygg commented Oct 28, 2024

Dankomaister commented Oct 28, 2024

Dankomaister commented Oct 28, 2024

brucefan1983 commented Oct 28, 2024

brucefan1983 commented Oct 28, 2024 • edited Loading

brucefan1983 commented Oct 28, 2024 • edited Loading

brucefan1983 commented Oct 30, 2024

brucefan1983 commented Oct 23, 2024 •

edited

Loading

elindgren left a comment •

edited

Loading

brucefan1983 commented Oct 28, 2024 •

edited

Loading

brucefan1983 commented Oct 28, 2024 •

edited

Loading