Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

From CUDA only to CUDA+HIP #768

Merged
merged 32 commits into from
Oct 30, 2024
Merged

From CUDA only to CUDA+HIP #768

merged 32 commits into from
Oct 30, 2024

Conversation

brucefan1983
Copy link
Owner

@brucefan1983 brucefan1983 commented Oct 23, 2024

Enable HIP compiling for running in AMD GPUs.

make # for CUDA
make -f makefile.hip # for HIP

@brucefan1983 brucefan1983 marked this pull request as ready for review October 23, 2024 21:49
Copy link
Collaborator

@elindgren elindgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool! What performance hit for HIP vs CUDA can we expect? From the issue #404 it seems like the HIP version performs quite a bit worse than expected?

src/force/ilp_tmd_sw.cu Outdated Show resolved Hide resolved
src/utilities/gpu_macro.cuh Show resolved Hide resolved
@elindgren
Copy link
Collaborator

@Dankomaister in #404 you also mentioned that MD was not working, and that the "results are incorrect". What results were incorrect, and has these issues been resolved?

@jesperbygg
Copy link

Thanks for the effort, getting this working smoothly would be fantastic!

I managed to compile on LUMI (AMD MI250X) after changing hipDeviceProp to hipDeviceProp_t as commented above. After running some quick tests, I have very similar conclusions as @Dankomaister in #404:

  • NEP training runs and produces sensible results, although slower than expected. For my test case, 1 GCD (half a MI250X GPU) runs 4x slower than on a V100 (CSC Puhti cluster).
  • MD segfaults after Started executing the commands in run.in., regardless of potential.

@brucefan1983
Copy link
Owner Author

Thanks for the effort, getting this working smoothly would be fantastic!

I managed to compile on LUMI (AMD MI250X) after changing hipDeviceProp to hipDeviceProp_t as commented above. After running some quick tests, I have very similar conclusions as @Dankomaister in #404:

  • NEP training runs and produces sensible results, although slower than expected. For my test case, 1 GCD (half a MI250X GPU) runs 4x slower than on a V100 (CSC Puhti cluster).
  • MD segfaults after Started executing the commands in run.in., regardless of potential.

MD segfaults sound like problem in CPU?

@jesperbygg
Copy link

Thanks for the effort, getting this working smoothly would be fantastic!
I managed to compile on LUMI (AMD MI250X) after changing hipDeviceProp to hipDeviceProp_t as commented above. After running some quick tests, I have very similar conclusions as @Dankomaister in #404:

  • NEP training runs and produces sensible results, although slower than expected. For my test case, 1 GCD (half a MI250X GPU) runs 4x slower than on a V100 (CSC Puhti cluster).
  • MD segfaults after Started executing the commands in run.in., regardless of potential.

MD segfaults sound like problem in CPU?

After some more investigation:

  • Compiling with -ggdb and running gpumd through the rocgdb debugger, the segfault details are:

Thread 1 "gpumd" received signal SIGSEGV, Segmentation fault. 0x000000000086d822 in is_valid_int (s=0x1002 <error: Cannot access memory at address 0x1002>, result=result@entry=0x7fffffff4004) at utilities/read_file.cu:27 27 if (s == NULL || *s == '\0') {

  • Compiling with -O0 or -O1 instead of -O3 works! -O2 does not work. With -O1, gpumd runs on one MI250X GCD with no errors ~4-5x slower than on a V100 (which is actually similar to my nep training comparison with -O3).
  • The compiler gives a lot of "loop not unrolled" warnings, but they appear also for -O1 which works:
    warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]

This is all on the LUMI cluster with HIP version: 6.0.32831-204d35d16 and rocm-6.0.3 with module load PrgEnv-amd rocm craype-x86-trento craype-accel-amd-gfx90a

@brucefan1983
Copy link
Owner Author

The "loop not unrolled" warnings should be gone now. For the segfaults, I have tried to use O1 for the code in the relevant folder and still O3 for other folders. Not sure if this help.

We have a plan to rewrite this part using modern C++, so it will be ok in the future.

@Dankomaister
Copy link

Nice to see that HIP support is getting added! great job @brucefan1983.
Once Dardel gets back up and running I will do some more testing on this.

The issue I mentioned in #404 (except for the segmentation fault) was that running MD on multiple GPUs would result in incorrect results i.e. forces not calculated properly. Not sure if this is still an issue? @jesperbygg did you try running MD with multiple GPUs?

@Dankomaister
Copy link

Ok I tried to compile the HIP version on Dardel using ROCm 5.7.0 (the latest version available on Dardel).
Training works fine but MD still gives segmentation fault…

I tried adding -ggdb and run GPUMD through the rocgdb debugger as @jesperbygg did and the error is still in read_file.cu

Thread 1 "gpumd" received signal SIGSEGV, Segmentation fault.
is_valid_int (s=0x1002 <error: Cannot access memory at address 0x1002>,
    result=result@entry=0x7fffffff81d4) at utilities/read_file.cu:29
29        } else if (*s == '\0') {

Changing -O1 to -O0 for CFLAGS_O1 solved the segmentation fault however GPUMD still crashes with this error

Started executing the commands in run.in.
Input Error:
    File:       main_gpumd/run.cu
    Line:       510
    Error text: seed should be a positive integer.

Adding a seed to the velocity command solved that problem and MD works!
Changing back to -O1 for CFLAGS_O1 also works so maybe there is a problem with parsing the velocity command?

I also tried to run MD on multiple GPUs however this still gives segmentation fault with

Thread 4 "gpumd" received signal SIGSEGV, Segmentation fault.
[Switching to thread 4, lane 0 (AMDGPU Lane 1:6:1:14/0 (13,0,0)[0,0,0])]
find_descriptor () at force/nep3_multigpu.cu:970
970               gn12 += fn12[k] * annmb.c[c_index];

So MD on multiple GPU still requires some work it seems.

@brucefan1983
Copy link
Owner Author

@Dankomaister Thank you for your further tests. The error message seed should be a positive integer is puzzling to me, as it only shows up when a seed is given for the velocity command, but you got it before giving the seed.

@brucefan1983
Copy link
Owner Author

I have tested CUDA version and the C++ optimization level does not affect the overall speed at all. So I changed the C++ optimization level to O1 for HIP globally. If needed it can also be changed to O0 . @Dankomaister Jesper has mentioned that using O0 and O1 globally for HIP worked. Do you want to have a look at this option?

@Dankomaister
Copy link

OK I will try with using -O0 and -O1 for for HIP globally.
However I think the problem is not with is_valid_int.
Since it is only supposed to be called in parse_velocity if num_param == 4.
But since it somehow gets called when using velocity 1234.0 there must be a problem elsewhere.
Perhaps in determining num_param ?

@jesperbygg
Copy link

For me on LUMI, compiling everything with -O1 as in the latest commit works without segmentation faults or errors (I don't specify any seed for velocity). I now also tried on multiple GPUs. It runs, but like @Dankomaister noticed previously it seems like the results are wrong. The potential energy for a test NEP is some 10 meV/atom different on 2 GCDs compared to 1 GCD, and even though I didn't check I assume the forces are not consistent because the energy is not conserved well on 2 GCDs.

@Dankomaister
Copy link

More testing, compiling everything with -O1 as in the latest commit works with and without seed.
Compiling everything with -O2 or -O3 gives segmentation fault without seed but works with seed!
These are the MD run times for the different optimization levels:
-O3: 30.90 s
-O2: 30.44 s
-O1: 36.24 s
-O0: -
note -O0 does not compile with error error: ran out of registers during register allocation.

I tried to figure out what happens during parsing of the velocity command that causes the error with -O2 and -O3.
By simply modifying the parse_velocity function to print the value of num_param

void Run::parse_velocity(const char** param, int num_param)
{
  printf("\nvel0 num_param: %d\n", num_param);
  fflush(stdout);
  int seed;
  bool use_seed = false;
  printf("\nvel1 num_param: %d\n", num_param);
  fflush(stdout);
  if (!(num_param == 2 || num_param == 4)) {
    PRINT_INPUT_ERROR("velocity should have 1 or 2 parameters.\n");
  }
  printf("\nvel2 num_param: %d\n", num_param);
  fflush(stdout);
  if (!is_valid_real(param[1], &initial_temperature)) {
    PRINT_INPUT_ERROR("initial temperature should be a real number.\n");
  }
  printf("\nvel3 num_param: %d\n", num_param);
  fflush(stdout);
  if (initial_temperature <= 0.0) {
    PRINT_INPUT_ERROR("initial temperature should be a positive number.\n");
  }
  printf("\nvel4 num_param: %d\n", num_param);
  fflush(stdout);
  if (num_param == 4) {
    printf("\nvel5 num_param: %d\n", num_param);
    fflush(stdout);
    use_seed = true;
    if (!is_valid_int(param[3], &seed)) {
      PRINT_INPUT_ERROR("seed should be a positive integer.\n");
    }
  }
  velocity.initialize(
    has_velocity_in_xyz,
    initial_temperature,
    atom.cpu_mass,
    atom.cpu_position_per_atom,
    atom.cpu_velocity_per_atom,
    atom.velocity_per_atom,
    use_seed,
    seed);
  if (!has_velocity_in_xyz) {
    printf("Initialized velocities with input T = %g K.\n", initial_temperature);
  }
}

I get the following output when running GPUMD

...
Use Nose-Hoover thermostat and Parrinello-Rahman barostat.
Use NPT ensemble for this run.
Thermostat: t_start is 4000.000000, t_stop is 1000.000000, t_period is 100.000000 timesteps
xx : p_start is 0.000000, p_stop is 0.000000, p_period is 1000.000000 timesteps
xy will not be changed.
xz will not be changed.
yx will not be changed.
yy : p_start is 0.000000, p_stop is 0.000000, p_period is 1000.000000 timesteps
yz will not be changed.
zx will not be changed.
zy will not be changed.
zz : p_start is 0.000000, p_stop is 0.000000, p_period is 1000.000000 timesteps

vel0 num_param: 2

vel1 num_param: 2

vel2 num_param: 2

vel3 num_param: 2

vel4 num_param: 2

vel5 num_param: 4

Thread 1 "gpumd" received signal SIGSEGV, Segmentation fault.
is_valid_int (s=0x1002 <error: Cannot access memory at address 0x1002>,
    result=result@entry=0x7fffffff81d4) at utilities/read_file.cu:29
29        } else if (*s == '\0') {

I cannot understand why but somehow the value of num_param gets changed to 4 ?! very strange...

@Dankomaister
Copy link

Hmm ok changing the if statements in parse_velocity to the following

void Run::parse_velocity(const char** param, int num_param)
{
  int seed;
  bool use_seed = false;
  
  if (!(num_param == 2 || num_param == 4)) {
    PRINT_INPUT_ERROR("velocity should have 1 or 2 parameters.\n");
  } else if (num_param == 4) {
    use_seed = true;
    if (!is_valid_int(param[3], &seed)) {
      PRINT_INPUT_ERROR("seed should be a positive integer.\n");
    }
  }
  
  if (!is_valid_real(param[1], &initial_temperature)) {
    PRINT_INPUT_ERROR("initial temperature should be a real number.\n");
  }
  if (initial_temperature <= 0.0) {
    PRINT_INPUT_ERROR("initial temperature should be a positive number.\n");
  }
  
  velocity.initialize(
    has_velocity_in_xyz,
    initial_temperature,
    atom.cpu_mass,
    atom.cpu_position_per_atom,
    atom.cpu_velocity_per_atom,
    atom.velocity_per_atom,
    use_seed,
    seed);
  if (!has_velocity_in_xyz) {
    printf("Initialized velocities with input T = %g K.\n", initial_temperature);
  }
}

seems to fix the problem... somehow... MD now runs with -O2 and -O3 :)

@brucefan1983
Copy link
Owner Author

It's so mysterious! How about if you add const to int?

void Run::parse_velocity(const char** param, const int num_param)

@brucefan1983
Copy link
Owner Author

brucefan1983 commented Oct 28, 2024

@Dankomaister

Indeed, the seed option was introduced in GPUMD-3.9, and you found HIP works for GPUMD-3.8. So it is indeed related to this parse_velocity function.

I just tried to initialize seed to a value. Note sure if this help.

@brucefan1983
Copy link
Owner Author

brucefan1983 commented Oct 28, 2024

@Dankomaister If there is no better way, I will change the code following your fix, and refer to this page in comments. We can investigate multi-GPU later.

@brucefan1983
Copy link
Owner Author

I have followed your cod to change this part @Dankomaister
I will merge this PR and study the multi-GPU version of HIP when I have an environment.

@brucefan1983 brucefan1983 merged commit aad40ec into master Oct 30, 2024
2 checks passed
@brucefan1983 brucefan1983 deleted the hip branch October 30, 2024 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants