Improve compile times #829

philipportner · 2024-09-23T19:07:13Z

This PR aims to improve compile times of DAPHNE itself (./build.sh). Improving the compilation times of a fully fresh build, including dependencies is not the goal of this PR.

To establish some baselines, on my machine a fresh DAPHNE build normally takes about 2min 35s.

Before ~2min 35s

Here we can see a profiled run the current state (9a1b93e)

I've made some changes, mainly splitting up the kernels.cpp file that is being generated by genKernelInst.py into compilation units per kernel. This improves parallelization as each translation unit can be built by itself. Yes, this does introduce some overhead in parsing the same headers in multiple translation units but for now the results are way better than before. I've also tried splitting into fewer translation units and splitting the kernels each into a translation unit provided the fastest compile times.

Modifying and recompiling a single/couple headers now also only takes a few seconds as they only need to recompile the changed translation units instead of all the kernels.

Additionally, I've made the libAllKernels an OBJECT library. This enables parallel compilation of the kernels library and the compiler parts.

Current ~1min 30s

With that, compilation times come down to about 1min 30s. The result can be seen here.

There still are some heavy hitters, the biggest offenders being the files that depend on MLIRDaphne and include daphne.h and the EigenCal.h kernel, which brings in the Eigen dependency.

Removing these 4 kernels from compilation would reduce the compilation time down to 1min. IMO we should move these few kernels out of the AllKernels target to decouple the kernel runtime from the compiler infrastructure.
Including daphne.h introduces a significant amount of overhead in multiple translation units and is included all over the codebase. I think decoupling some of the MLIR related parts from daphne.h would improve overall compilation times and should be done in the future as this PR concerns itself only with the kernels library.

Open issues:

fix duplicate symbols when compiling tests
split kernel also for CUDA compilation, although I have to look more

corepointer · 2024-09-24T09:34:31Z

Nice progress with this issue 👌

In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style.

I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?).

corepointer · 2024-09-24T09:39:46Z

The CUDA kernels don't have so much impact as there are fewer of them, but they can also benefit of this imho. In general I'm positive towards more granularity in compilation and reduce the weight of libAllKernels.

corepointer · 2024-09-24T09:45:28Z

I don't want to hijack the discussion here but take the opportunity to raise attention to #474 where I never received feedback. Running build.sh from scratch does not only take a lot of time and resources for building but also the git checkout of the LLVM mono repo is a matter of a few minutes. Using a snapshot (which I'm mostly doing as the feature is implemented, but I didn't want to change defaults without discussion) speeds this up considerably. 🚤

philipportner · 2024-09-24T09:46:19Z

Nice progress with this issue 👌

In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style.

I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?).

Mhm from what I've seen linking shouldn't take too long, are you sure cmake is using lld? How did you measure that?

The images are generated by parsing the build/.ninja_log file with ninjatracing and viewing here. See #516 for more.

philipportner · 2024-09-24T09:52:30Z

lld is linking Clang 19 (1.56 GiB) and Chromium 124 (1.35 GiB) in ~5 seconds, can't imagine it being a huge part in DAPHNE. Also not sure what other options we have besides mold which is from the same person as lld.

corepointer · 2024-09-24T09:53:27Z

Nice progress with this issue 👌
In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style.
I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?).

I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it 🙈

Mhm from what I've seen linking shouldn't take too long, are you sure cmake is using lld? How did you measure that?

The images are generated by parsing the build/.ninja_log file with ninjatracing and viewing here. See #516 for more.

👍

philipportner · 2024-09-24T09:56:21Z

I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it 🙈

There should be an early message output saying Using lld. Otherwise I'm not really sure, I don't seem to have that problem :')

corepointer · 2024-09-24T10:14:58Z

I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it 🙈

There should be an early message output saying Using lld. Otherwise I'm not really sure, I don't seem to have that problem :')

I guess this does not work as it is saying both -- Using lld and before that -- Linker detection: GNU ld and link times stay slow.

corepointer · 2024-09-24T10:15:52Z

I just tested these changes and can report an improvement from 2m47 -> 1m17 🥇 🎉 🚀

corepointer · 2024-10-01T10:19:53Z

Results from running in the container:
Build reports success but libAllKernels.so is not there. Therefore, running anything fails.

philipportner · 2024-10-01T16:11:33Z

Results from running in the container: Build reports success but libAllKernels.so is not there. Therefore, running anything fails.

That should only have happened if you didn't pull updates in this PR as this has been fixed in 6532eaa . Could that be the case? For GPU to work you'd need the up to date PR.

corepointer

Tests are running fine now with and without CUDA. I'd merge it in once the mystery of 20 and 104 is resolved.

src/runtime/local/kernels/CMakeLists.txt

corepointer · 2024-10-03T10:23:56Z

Oh the PR is still in draft state. Further changes planned for this?

philipportner · 2024-10-03T10:25:37Z

Oh the PR is still in draft state. Further changes planned for this?

Actually it's no longer in draft, maybe I shouldn't put that in the title :)

philipportner · 2024-10-03T10:36:52Z

Tests are running fine now with and without CUDA. I'd merge it in once the mystery of 20 and 104 is resolved.

Thanks a lot for testing this with CUDA @corepointer !

corepointer · 2024-10-03T10:49:34Z

Oh the PR is still in draft state. Further changes planned for this?

Actually it's no longer in draft, maybe I shouldn't put that in the title :)

my bad to jump to conclusions 🙈

This patch splits up kernels.cpp and CUDAkernels.cpp into multiple translation units, one per kernel. This improves compilation times, see the PR #829. closes #516.

This patch splits up kernels.cpp and CUDAkernels.cpp into multiple translation units, one per kernel. This improves compilation times. For a discussion on this matter, see issue #516 and the PR #829. Resolves #516, Closes #829

corepointer · 2024-10-06T09:14:40Z

LGTM - ran the test suite once again locally and fine tuned the commit message a bit while merging ;-)

Thx for your efforts @philipportner

philipportner mentioned this pull request Sep 23, 2024

Split generated kernels.cpp for faster builds #516

Closed

corepointer linked an issue Sep 24, 2024 that may be closed by this pull request

Split generated kernels.cpp for faster builds #516

Closed

philipportner force-pushed the 516-compile_times branch from 4bc8c5c to a629d8d Compare September 26, 2024 20:31

philipportner marked this pull request as ready for review September 30, 2024 17:50

corepointer requested changes Oct 3, 2024

View reviewed changes

src/runtime/local/kernels/CMakeLists.txt Outdated Show resolved Hide resolved

src/runtime/local/kernels/CMakeLists.txt Outdated Show resolved Hide resolved

src/runtime/local/kernels/CMakeLists.txt Outdated Show resolved Hide resolved

philipportner changed the title ~~[draft] improve compile times~~ Improve compile times Oct 3, 2024

philipportner added a commit that referenced this pull request Oct 5, 2024

split up generated kernels.cpp #829

529f0cb

This patch splits up kernels.cpp and CUDAkernels.cpp into multiple translation units, one per kernel. This improves compilation times, see the PR #829. closes #516.

philipportner force-pushed the 516-compile_times branch from 298366f to 529f0cb Compare October 5, 2024 19:37

philipportner added a commit that referenced this pull request Oct 5, 2024

split up generated kernels.cpp #829

344d9ac

This patch splits up kernels.cpp and CUDAkernels.cpp into multiple translation units, one per kernel. This improves compilation times, see the PR #829. closes #516.

philipportner force-pushed the 516-compile_times branch from 529f0cb to 344d9ac Compare October 5, 2024 20:20

corepointer force-pushed the 516-compile_times branch from 344d9ac to 40ca523 Compare October 6, 2024 09:13

corepointer merged commit 40ca523 into main Oct 6, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve compile times #829

Improve compile times #829

philipportner commented Sep 23, 2024 •

edited

Loading

corepointer commented Sep 24, 2024

corepointer commented Sep 24, 2024

corepointer commented Sep 24, 2024

philipportner commented Sep 24, 2024

philipportner commented Sep 24, 2024

corepointer commented Sep 24, 2024

philipportner commented Sep 24, 2024 •

edited

Loading

corepointer commented Sep 24, 2024

corepointer commented Sep 24, 2024

corepointer commented Oct 1, 2024

philipportner commented Oct 1, 2024

corepointer left a comment

corepointer commented Oct 3, 2024

philipportner commented Oct 3, 2024

philipportner commented Oct 3, 2024

corepointer commented Oct 3, 2024

corepointer commented Oct 6, 2024

Improve compile times #829

Improve compile times #829

Conversation

philipportner commented Sep 23, 2024 • edited Loading

Before ~2min 35s

Current ~1min 30s

corepointer commented Sep 24, 2024

corepointer commented Sep 24, 2024

corepointer commented Sep 24, 2024

philipportner commented Sep 24, 2024

philipportner commented Sep 24, 2024

corepointer commented Sep 24, 2024

philipportner commented Sep 24, 2024 • edited Loading

corepointer commented Sep 24, 2024

corepointer commented Sep 24, 2024

corepointer commented Oct 1, 2024

philipportner commented Oct 1, 2024

corepointer left a comment

Choose a reason for hiding this comment

corepointer commented Oct 3, 2024

philipportner commented Oct 3, 2024

philipportner commented Oct 3, 2024

corepointer commented Oct 3, 2024

corepointer commented Oct 6, 2024

philipportner commented Sep 23, 2024 •

edited

Loading

philipportner commented Sep 24, 2024 •

edited

Loading