Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constructing a high resolution space on GPU fails #2096

Open
juliasloan25 opened this issue Dec 3, 2024 · 5 comments · May be fixed by #2100
Open

Constructing a high resolution space on GPU fails #2096

juliasloan25 opened this issue Dec 3, 2024 · 5 comments · May be fixed by #2100
Assignees
Labels
bug Something isn't working

Comments

@juliasloan25
Copy link
Member

Describe the bug

When I try to set up a space with nelements >= 200, I get the following error: ERROR: LoadError: Number of blocks in y-dimension exceeds device limit (240000 > 65535).

I found this online: "Blocks can be organized into one, two or three-dimensional grids of up to 2^31-1, 65,535 and 65,535 blocks in the x, y and z dimensions respectively." It looks like we're hitting the limit in the y dimension when we construct our space, but there should be plenty of space in the x dimension. Maybe the usage can be changed in ClimaCore?

To Reproduce

[on clima node]

srun --gpus=1 --mpi=none -t 02:00:00 --pty bash -l 
export CLIMACOMMS_DEVICE="CUDA"
module load common
julia --project=.buildkite
import ClimaComms
ClimaComms.@import_required_backends
import ClimaCore:
    Domains,
    Fields,
    Geometry,
    Meshes,
    Spaces,
    Topologies

FT = Float64
radius = FT(6378.1e3)
depth = FT(50)
nelements = (200, 15)
dz_tuple = FT.((10.0, 0.05))
npolynomial = 1

device = ClimaComms.device()
comms_ctx = ClimaComms.context()

vertdomain = Domains.IntervalDomain(
    Geometry.ZPoint(FT(-depth)),
    Geometry.ZPoint(FT(0));
    boundary_names = (:bottom, :top),
)
vertmesh = Meshes.IntervalMesh(
    vertdomain,
    Meshes.GeneralizedExponentialStretching{FT}(
        dz_tuple[1],
        dz_tuple[2],
    );
    nelems = nelements[2],
    reverse_mode = true,
)
vert_center_space = Spaces.CenterFiniteDifferenceSpace(device, vertmesh)

horzdomain = Domains.SphereDomain(radius)
horzmesh = Meshes.EquiangularCubedSphere(horzdomain, nelements[1])
horztopology = Topologies.Topology2D(comms_ctx, horzmesh)
quad = Spaces.Quadratures.GLL{npolynomial + 1}()
horzspace = Spaces.SpectralElementSpace2D(horztopology, quad)

# Fails with `ERROR: Number of blocks in y-dimension exceeds device limit (240000 > 65535).`
subsurface_space = Spaces.ExtrudedFiniteDifferenceSpace(
    horzspace,
    vert_center_space,
)

Setup information

Using ClimaCore v0.14.20

[jsloan@clima ClimaCore.jl]$ module list
Currently Loaded Modulefiles:
 1) openmpi/4.1.5-mpitrampoline   2) julia/1.10.0   3) cuda/julia-pref   4) common
@juliasloan25 juliasloan25 added the bug Something isn't working label Dec 3, 2024
@Sbozzolo
Copy link
Member

Sbozzolo commented Dec 5, 2024

Shorter reproducer:

import ClimaCore
center_space = ClimaCore.CommonSpaces.ExtrudedCubedSphereSpace(Float32;
                                                                      radius = 1.0,
                                                                      h_elem = 105,
                                                                      z_elem = 10,
                                                                      z_min = 1.0,
                                                                      z_max = 2.0,
                                                                      n_quad_points = 4,  staggering = ClimaCore.Grids.CellCenter())

Anything more than 104 fails

@Sbozzolo
Copy link
Member

While this is worked on, we can work around by setting auto = true and passing nitems in Base.copyto! in data_layouts_copyto!. I haven't tried running a full simulation yet.

                args = (dest, bc, us)
                threads = threads_via_occupancy(knl_copyto!, args)
                n_max_threads = min(threads, get_N(us))
                p = partition(dest, n_max_threads)
                nitems = get_N(us)
                auto_launch!(
                    knl_copyto!,
                    args,
                    nitems;
                    auto = true,
                    threads_s = p.threads,
                    blocks_s = p.blocks,
                )

@sriharshakandala
Copy link
Member

We have kernel launch patterns, that use the grid configuration (Nv, Nh) .

h_elem = 105 corresponds to Nh = 66,150 spectral elements, which exceeds the 65,535 limit for the second dimension of the CUDA grid. However, in the vertical direction we rarely use over 256 vertical levels, which translates to a Nv=16 or lower in most cases. (Nv is approximately equal to n_vertical_levels / 16 or lower)
We have the following options:

  1. Flip to (Nh, Nv), as Nv is very small for most (almost all of our) use cases. We will hit this limit at 1,048,560 vertical levels with Nq=4, something we do not anticipate to use.
  2. Move to one dimensional indexing (N,) and extract block indexes for h and v from the one-dimensional block id.

The first option is the easiest to use, unless we have a good reason to use the second option.

@Sbozzolo
Copy link
Member

Sbozzolo commented Dec 12, 2024

From a user point of view, I think we should try avoiding any limit in a foundational package like ClimaCore. We don't know what configurations users are going to set up, so I think there shouldn't be any artificial restriction on the maximum number of levels/elements one can place in either vertical or the horizontal direction. If we just swap Nh with Nv, this problem with come back when someone will try to run a high-vertical-resolution box or column.

@sriharshakandala
Copy link
Member

sriharshakandala commented Dec 12, 2024

Sure. With option 1, we can still loop over vertical level blocks once we hit the limit. It's primarily about significantly increasing the limit for Nh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants