Parallelize across multiple GPUs with MPI4Jax #1071

kianorr · 2024-06-25T19:47:51Z

The general idea is

Physics objectives are put onto separate GPUs
Constraints are taken out
Then combine jacobians on a single GPU to create the $A$ matrix, and perform SVD on this $A$ matrix
From this a new eq is created, which is then fed back to each GPU

The text was updated successfully, but these errors were encountered:

dpanici · 2024-08-13T20:07:13Z

Add view-only link to hackathon doc

kianorr · 2024-08-20T18:33:36Z

instructions to install on della are under "mpi4jax installation instructions on della-gpu" in this google doc: https://docs.google.com/document/d/1x6nGZEiZnAiWBDf20Mcbwbob9a6GMbx3ZpT_huUVZF4/edit?usp=sharing

dpanici · 2024-11-25T19:40:40Z

@kianorr @f0uriest can you put here what the constraints/limitations of this approach are and what situtations it could actually help with?

YigitElma · 2024-12-02T01:40:28Z

Maybe instead of separating objectives, we could distribute the transforms and profiles of the compute method to different GPU's and this can make use of multi-GPU parallelism even for single objective (like ForceBalance) cases.

We can distribute quantities related to each grid point and then parallelize over it. Then the question is, can we use this for flux surface averaged stuff? For example, for ForceBalance is there a dependent quantity that requires some average for the calculation?

@dpanici @f0uriest @ddudt

dpanici · 2024-12-02T19:49:05Z

Flag for grids to pad out to have grid.num_nodes be evenly divisible by number of GPUs, and have the extra pad nodes have weight of 0 assigned in grid.weights

ddudt · 2024-12-02T20:18:50Z

More info:

Memory usage can already be reduced for many applications with either deriv_mode="blocked" or jac_chunk_size=1 and deriv_mode="batched" on the outer ObjectiveFunction
- with maximal memory reduction with deriv_mode="blocked" and then jac_chunk_size=1 on each sub Objective in the ObjectiveFunction. If the problem still won't fit with these settings, then multi-gpu with parallelization over the objectives won't help anymore.
If the issue is that the resolution is too high (Jacobian is too "wide"), then jac_chunk_size=1 is the solution, and resolving this issue would not help further
If the issue is that there are too many objectives (Jacobian is too "tall"), then deriv_mode="blocked" is the solution, and resolving this issue would not help besides giving a speed improvement
If memory is still an issue, then parallelizing the grid nodes across multiple devices could help. This is conceptually similar to deriv_mode="blocked", but would sub-divide each objective (helping with "tall" Jacobians, or very complex objectives)
Parallelizing across the objectives and/or the grid nodes would require some plumbing to implement
- with parallelization across grid nodes likely requiring the most work

YigitElma · 2024-12-24T18:46:04Z

Some initial trials for a dummyfunction, just for getting used to the syntax. I ran this on della with 3 A100s.
multi-gpu-jax.zip

There is a dummy compute function,

@jax.jit
def compute(params, points):
    modes = jnp.arange(len(params))
    eval = jnp.outer(modes, points)
    res = jnp.dot(params, jnp.cos(eval))
    return res

Some benchmarks with inputs,

num_sharding = 3
mesh = jax.make_mesh((num_sharding,), ('points',))
sharding = NamedSharding(mesh, P('points'))
replicated_sharding = NamedSharding(mesh, P())

points = jnp.arange(3*30000)
params = jax.random.normal(jax.random.key(0), 1000)

points_sharded = jax.device_put(points, sharding)
params_sharded = jax.device_put(params, replicated_sharding)

compute(params_sharded, points_sharded)
%timeit compute(params_sharded, points_sharded).block_until_ready()

compute(params, points)
%timeit compute(params, points).block_until_ready()

compute(params_sharded, points)
%timeit compute(params_sharded, points).block_until_ready()

643 μs ± 3.61 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.67 ms ± 818 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.82 ms ± 7.73 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Also for making everything faster (I guess now the bottleneck is the QR) we might need jax-ml/jax#16597.

YigitElma · 2024-12-24T18:56:07Z

Then the question is, can we use this for flux surface averaged stuff?

Distribute the grid points such that the ones at the same flux surface stay on the same device. Depending on the grid used, we might need to assign different number of flux surfaces to different devices, to keep the number of grid points the the same for each device.

dpanici mentioned this issue Aug 20, 2024

Tutorial: Advanced AD / multi-gpu parallelism #775

Open

f0uriest mentioned this issue Sep 2, 2024

Experimental support for optimization using multiple devices #763

Closed

dpanici added the P3 Highest Priority, someone is/should be actively working on this label Nov 11, 2024

dpanici added the question Further information is requested label Nov 25, 2024

ddudt changed the title ~~Implement multi-gpu functionality for separate objectives with MPI4Jax~~ Parallelize across multiple GPUs with MPI4Jax Dec 2, 2024

YigitElma linked a pull request Dec 25, 2024 that will close this issue

Support for Multiple GPUs #1495

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize across multiple GPUs with MPI4Jax #1071

Parallelize across multiple GPUs with MPI4Jax #1071

kianorr commented Jun 25, 2024

dpanici commented Aug 13, 2024

kianorr commented Aug 20, 2024

dpanici commented Nov 25, 2024

YigitElma commented Dec 2, 2024

dpanici commented Dec 2, 2024

ddudt commented Dec 2, 2024 •

edited by dpanici

Loading

YigitElma commented Dec 24, 2024 •

edited

Loading

YigitElma commented Dec 24, 2024

Parallelize across multiple GPUs with MPI4Jax #1071

Parallelize across multiple GPUs with MPI4Jax #1071

Comments

kianorr commented Jun 25, 2024

dpanici commented Aug 13, 2024

kianorr commented Aug 20, 2024

dpanici commented Nov 25, 2024

YigitElma commented Dec 2, 2024

dpanici commented Dec 2, 2024

ddudt commented Dec 2, 2024 • edited by dpanici Loading

YigitElma commented Dec 24, 2024 • edited Loading

YigitElma commented Dec 24, 2024

ddudt commented Dec 2, 2024 •

edited by dpanici

Loading

YigitElma commented Dec 24, 2024 •

edited

Loading