Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve compile times #829

Merged
merged 1 commit into from
Oct 6, 2024
Merged

Improve compile times #829

merged 1 commit into from
Oct 6, 2024

Conversation

philipportner
Copy link
Collaborator

@philipportner philipportner commented Sep 23, 2024

This PR aims to improve compile times of DAPHNE itself (./build.sh). Improving the compilation times of a fully fresh build, including dependencies is not the goal of this PR.

To establish some baselines, on my machine a fresh DAPHNE build normally takes about 2min 35s.

Before ~2min 35s

Here we can see a profiled run the current state (9a1b93e)
image

I've made some changes, mainly splitting up the kernels.cpp file that is being generated by genKernelInst.py into compilation units per kernel. This improves parallelization as each translation unit can be built by itself. Yes, this does introduce some overhead in parsing the same headers in multiple translation units but for now the results are way better than before. I've also tried splitting into fewer translation units and splitting the kernels each into a translation unit provided the fastest compile times.

Modifying and recompiling a single/couple headers now also only takes a few seconds as they only need to recompile the changed translation units instead of all the kernels.

Additionally, I've made the libAllKernels an OBJECT library. This enables parallel compilation of the kernels library and the compiler parts.

Current ~1min 30s

With that, compilation times come down to about 1min 30s. The result can be seen here.
image

There still are some heavy hitters, the biggest offenders being the files that depend on MLIRDaphne and include daphne.h and the EigenCal.h kernel, which brings in the Eigen dependency.

Removing these 4 kernels from compilation would reduce the compilation time down to 1min. IMO we should move these few kernels out of the AllKernels target to decouple the kernel runtime from the compiler infrastructure.
Including daphne.h introduces a significant amount of overhead in multiple translation units and is included all over the codebase. I think decoupling some of the MLIR related parts from daphne.h would improve overall compilation times and should be done in the future as this PR concerns itself only with the kernels library.

Open issues:

  • fix duplicate symbols when compiling tests
  • split kernel also for CUDA compilation, although I have to look more

@corepointer
Copy link
Collaborator

Nice progress with this issue 👌

In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style.

I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?).

@corepointer
Copy link
Collaborator

The CUDA kernels don't have so much impact as there are fewer of them, but they can also benefit of this imho. In general I'm positive towards more granularity in compilation and reduce the weight of libAllKernels.

@corepointer
Copy link
Collaborator

I don't want to hijack the discussion here but take the opportunity to raise attention to #474 where I never received feedback. Running build.sh from scratch does not only take a lot of time and resources for building but also the git checkout of the LLVM mono repo is a matter of a few minutes. Using a snapshot (which I'm mostly doing as the feature is implemented, but I didn't want to change defaults without discussion) speeds this up considerably. 🚤

@philipportner
Copy link
Collaborator Author

Nice progress with this issue 👌

In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style.

I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?).

Mhm from what I've seen linking shouldn't take too long, are you sure cmake is using lld? How did you measure that?

The images are generated by parsing the build/.ninja_log file with ninjatracing and viewing here. See #516 for more.

@philipportner
Copy link
Collaborator Author

lld is linking Clang 19 (1.56 GiB) and Chromium 124 (1.35 GiB) in ~5 seconds, can't imagine it being a huge part in DAPHNE. Also not sure what other options we have besides mold which is from the same person as lld.

@corepointer
Copy link
Collaborator

Nice progress with this issue 👌
In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style.
I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?).

I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it 🙈

Mhm from what I've seen linking shouldn't take too long, are you sure cmake is using lld? How did you measure that?

The images are generated by parsing the build/.ninja_log file with ninjatracing and viewing here. See #516 for more.

👍

@philipportner
Copy link
Collaborator Author

philipportner commented Sep 24, 2024

I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it 🙈

There should be an early message output saying Using lld. Otherwise I'm not really sure, I don't seem to have that problem :')

@corepointer
Copy link
Collaborator

I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it 🙈

There should be an early message output saying Using lld. Otherwise I'm not really sure, I don't seem to have that problem :')

I guess this does not work as it is saying both -- Using lld and before that -- Linker detection: GNU ld and link times stay slow.

@corepointer
Copy link
Collaborator

I just tested these changes and can report an improvement from 2m47 -> 1m17 🥇 🎉 🚀

@philipportner philipportner marked this pull request as ready for review September 30, 2024 17:50
@corepointer
Copy link
Collaborator

Results from running in the container:
Build reports success but libAllKernels.so is not there. Therefore, running anything fails.

@philipportner
Copy link
Collaborator Author

Results from running in the container: Build reports success but libAllKernels.so is not there. Therefore, running anything fails.

That should only have happened if you didn't pull updates in this PR as this has been fixed in 6532eaa . Could that be the case? For GPU to work you'd need the up to date PR.

Copy link
Collaborator

@corepointer corepointer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are running fine now with and without CUDA. I'd merge it in once the mystery of 20 and 104 is resolved.

src/runtime/local/kernels/CMakeLists.txt Outdated Show resolved Hide resolved
src/runtime/local/kernels/CMakeLists.txt Outdated Show resolved Hide resolved
src/runtime/local/kernels/CMakeLists.txt Outdated Show resolved Hide resolved
@corepointer
Copy link
Collaborator

Oh the PR is still in draft state. Further changes planned for this?

@philipportner philipportner changed the title [draft] improve compile times Improve compile times Oct 3, 2024
@philipportner
Copy link
Collaborator Author

Oh the PR is still in draft state. Further changes planned for this?

Actually it's no longer in draft, maybe I shouldn't put that in the title :)

@philipportner
Copy link
Collaborator Author

Tests are running fine now with and without CUDA. I'd merge it in once the mystery of 20 and 104 is resolved.

Thanks a lot for testing this with CUDA @corepointer !

@corepointer
Copy link
Collaborator

Oh the PR is still in draft state. Further changes planned for this?

Actually it's no longer in draft, maybe I shouldn't put that in the title :)

my bad to jump to conclusions 🙈

philipportner added a commit that referenced this pull request Oct 5, 2024
This patch splits up kernels.cpp and CUDAkernels.cpp into multiple
translation units, one per kernel. This improves compilation times, see
the PR #829.

closes #516.
philipportner added a commit that referenced this pull request Oct 5, 2024
This patch splits up kernels.cpp and CUDAkernels.cpp into multiple
translation units, one per kernel. This improves compilation times, see
the PR #829.

closes #516.
This patch splits up kernels.cpp and CUDAkernels.cpp into multiple
translation units, one per kernel. This improves compilation times. For a discussion on this matter, see issue #516 and the PR #829.

Resolves #516, Closes #829
@corepointer
Copy link
Collaborator

LGTM - ran the test suite once again locally and fine tuned the commit message a bit while merging ;-)

Thx for your efforts @philipportner

@corepointer corepointer merged commit 40ca523 into main Oct 6, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Split generated kernels.cpp for faster builds
2 participants