Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TKW] torch Out of Memory during e2e test with 16 cores #297

Open
raikonenfnu opened this issue Nov 26, 2024 · 0 comments
Open

[TKW] torch Out of Memory during e2e test with 16 cores #297

raikonenfnu opened this issue Nov 26, 2024 · 0 comments

Comments

@raikonenfnu
Copy link
Contributor

FAILED tests/kernel/wave/wave_attention_test.py::testAttentionF8[mfma_variant0-False-shape1] - torch.OutOfMemoryError: HIP out of memory. Tried to allocate 10.00 MiB. GPU 0 has a total capacity of 24.00 GiB of which 0 bytes...
FAILED tests/kernel/wave/wave_attention_test.py::testAttention[MMAType.F32_32x32x8_F16-False-True-shape1] - torch.OutOfMemoryError: HIP out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 24.00 GiB of which 0 bytes...
FAILED tests/kernel/wave/wave_attention_test.py::testAttention[MMAType.F32_32x32x8_F16-True-False-shape1] - torch.OutOfMemoryError: HIP out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 24.00 GiB of which 0 bytes ...
FAILED tests/kernel/wave/wave_attention_test.py::testAttention[MMAType.F32_16x16x16_F16-False-True-shape1] - torch.OutOfMemoryError: HIP out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 24.00 GiB of which 0 bytes...
FAILED tests/kernel/wave/wave_gemm_test.py::testF8Gemm[MMAType.F32_32x32x16_F8-True-shape2] - torch.OutOfMemoryError: HIP out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 24.00 GiB of which 0 byte...

E           torch.OutOfMemoryError: HIP out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 24.00 GiB of which 0 bytes is free. Of the allocated memory 46.38 MiB is allocated by PyTorch, and 7.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

When running e2e tests (i.e wave_e2e_tests.py/ e2e gemm/ e2e attention) with 16 workers.

This does not break when we use 4 workers. This didn't used to happen when we used the old compile_and_invoke since at that point we use transient memory for inputs and IREE quickly discards them after kernel call. We can try to resolve this by somehow making only the compile use multicore but running use single core, since only the compile takes a while.

CC: @Hardcode84

@raikonenfnu raikonenfnu changed the title [TKW] torch Out of Memory [TKW] torch Out of Memory during e2e test with 16 cores Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant