-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve compile times #829
Conversation
Nice progress with this issue 👌 In general it is more beneficial for compile times to split declarations from definitions to create more parallelism. We're doing that here and there but it's rather up to the individual developer. As with the code formatting this is an effect of not having the discussion about coding style. I've seen quite some time spent while linking. Using the LLVM linker didn't improve things too dramatically. It seems that this is also greatly improved with these changes from what I gathered from the images above (how did you generate those btw?). |
The CUDA kernels don't have so much impact as there are fewer of them, but they can also benefit of this imho. In general I'm positive towards more granularity in compilation and reduce the weight of libAllKernels. |
I don't want to hijack the discussion here but take the opportunity to raise attention to #474 where I never received feedback. Running |
Mhm from what I've seen linking shouldn't take too long, are you sure The images are generated by parsing the |
|
I was relying on the configuration in our main CMakeLists to use LLD. I didn't do a study on it. I just didn't see a lot of speedup after this was introduced and also, at the end of the compilation process there's a single core on 100% for minutes. I thought LLD can do some some magic there without further looking into it 🙈
👍 |
There should be an early message output saying |
I guess this does not work as it is saying both |
I just tested these changes and can report an improvement from 2m47 -> 1m17 🥇 🎉 🚀 |
4bc8c5c
to
a629d8d
Compare
Results from running in the container: |
That should only have happened if you didn't pull updates in this PR as this has been fixed in 6532eaa . Could that be the case? For GPU to work you'd need the up to date PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are running fine now with and without CUDA. I'd merge it in once the mystery of 20 and 104 is resolved.
Oh the PR is still in draft state. Further changes planned for this? |
Actually it's no longer in draft, maybe I shouldn't put that in the title :) |
Thanks a lot for testing this with CUDA @corepointer ! |
my bad to jump to conclusions 🙈 |
298366f
to
529f0cb
Compare
529f0cb
to
344d9ac
Compare
344d9ac
to
40ca523
Compare
LGTM - ran the test suite once again locally and fine tuned the commit message a bit while merging ;-) Thx for your efforts @philipportner |
This PR aims to improve compile times of DAPHNE itself (
./build.sh
). Improving the compilation times of a fully fresh build, including dependencies is not the goal of this PR.To establish some baselines, on my machine a fresh DAPHNE build normally takes about 2min 35s.
Before ~2min 35s
Here we can see a profiled run the current state (9a1b93e)
I've made some changes, mainly splitting up the
kernels.cpp
file that is being generated bygenKernelInst.py
into compilation units per kernel. This improves parallelization as each translation unit can be built by itself. Yes, this does introduce some overhead in parsing the same headers in multiple translation units but for now the results are way better than before. I've also tried splitting into fewer translation units and splitting the kernels each into a translation unit provided the fastest compile times.Modifying and recompiling a single/couple headers now also only takes a few seconds as they only need to recompile the changed translation units instead of all the kernels.
Additionally, I've made the
libAllKernels
anOBJECT
library. This enables parallel compilation of the kernels library and the compiler parts.Current ~1min 30s
With that, compilation times come down to about 1min 30s. The result can be seen here.
There still are some heavy hitters, the biggest offenders being the files that depend on
MLIRDaphne
and includedaphne.h
and theEigenCal.h
kernel, which brings in theEigen
dependency.Removing these 4 kernels from compilation would reduce the compilation time down to 1min. IMO we should move these few kernels out of the
AllKernels
target to decouple the kernel runtime from the compiler infrastructure.Including
daphne.h
introduces a significant amount of overhead in multiple translation units and is included all over the codebase. I think decoupling some of the MLIR related parts fromdaphne.h
would improve overall compilation times and should be done in the future as this PR concerns itself only with the kernels library.Open issues: