-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threading #15
Comments
I agree the new partr framework seems to be the way people are heading also for threading support in the lower libraries and it only makes sense to follow along with it. Especially, since users of our code could do all sorts of things on top. Regarding |
So, I did some very basic experiments. For a system with 400,000 plane waves, FFTW's own threading doesn't seem to do much: setting both FFTW and BLAS threads to the number of cores on my computer gave me a 20% speedup. So we should either do #9, or do our own threading |
Hmm 20% is surprisingly little, but maybe I misunderstand what you did. Could you perhaps commit a small benchmark script. I think it would be good to have a few "benchmark cases" or integrate with https://github.com/JuliaCI/PkgBenchmark.jl such that one can track the performance better. What do you think? |
That's |
@mfherbst can you try the following benchmarking script on the machine you have? https://gist.github.com/antoine-levitt/88086895dd98f746d6c795c99a10fd9f Here I get
So the good news is that all methods of parallelization are esssentially the same. The bad news is that they all suck :-) It looks like FFTs are almost memory-bound, and so do not benefit much from parallelization (at least on my machine). That's on julia 1.3. I'd test on the lab's cluster, but I'm getting proxy errors... |
My machine (julia 1.3, fftw)
|
Cluster08 (julia 1.2, MKL)
and (again 1.2, MKL)
|
clustern20 (with julia 1.1, I can't make 1.3 work with the proxy for some reason):
That's much better. I think that's consistent with FFTs being memory limited, but memory scaling differently on different machines. Takeaways: oversubscription is fine, FFTW doesn't do better than outer threading. So my suggestion is to plan for a single (like we do now) threaded FFT (by setting |
Be careful with the 32 threads on cluster 20 ... it has hyper threading enabled, so effectively it's only 16 cores |
Yeah I know, that was basically to test oversubscription |
Julia 1.3 has changed the way they update the registries in a way that it seems to ignore the proxy settings ... I've had the same issues. |
For FFTW I think you are right, but for MKL's FFT the picture seems to be different. |
A bit, but maybe the results are too noisy. Can you run the 16 threads test again? I want to see if
should be trusted or not. |
Another run:
and yet one more:
and on another machine (cc09):
|
The difference is similar in each case 50 to 60 ms. |
Hm. So results are inconsistent, but always in the same direction. I'm tempted to ignore... We really should see what it does with 1.3 (or even better, master). There are a few open issues on the julia github about proxies, I posted in one, but proxies are a uniform pain. |
But really, what this all shows is that a single FFT is already pretty well parallelized. Meaning that we can just ignore this and not do any threading at all (ie what we have now), and it'll be within a factor of 2 of the optimal (at least for these sizes). If we just add |
For proxy issues, see julia issue 33111, that fixed it for me |
So 1.3 improves the manual_threaded for me:
Still a slight edge for auto FFTW on 32 cores, but that changes from benchmark to benchmark, and when I repeated it manual_threaded was faster. So let's go with #77 and not bother too much. |
I agree. Especially since this keeps more control on our end and opens way to integrate with the developments happening in Julia in the future. |
OK, let's close this one for now then. We can revisit according to profiling. |
One thing is that FFTW defaults to no threading. Let's keep that manual for now, but note for later that we have to |
That is not true. For me it does. |
See https://github.com/JuliaMath/FFTW.jl/blob/master/src/FFTW.jl#L59. This is activated if |
Oh, you're absolutely right, I stopped at https://github.com/JuliaMath/FFTW.jl/blob/master/src/FFTW.jl#L41. They're really confident oversubscription is not a problem then! |
Indeed. I just saw that, too. |
New developments:
https://julialang.org/blog/2019/07/multithreading
JuliaMath/FFTW.jl#105
JuliaLang/julia#32786
There's also the StridedArrays package that automatically parallelizes broadcasts.
Note that there's a significant overhead for now: https://discourse.julialang.org/t/multithreaded-broadcast/26786, which appears to be a known issue that will get better at some point in the future JuliaLang/julia#32701 (comment)
So it looks like the preferred model will be that julia's scheduler handles all the threading, and the underlying libraries use julia's threads. Essentially this means that we will be able to just set JULIA_NUM_THREADS and get threaded FFT/BLAS from there. If we find out that this is too fine-grained to yield good speedup, we can add explicit annotations (eg
@threads
on the loop over bands for the Hamiltonian application, or@strided
on selected time-intensive broadcasts), and that should work fine.The text was updated successfully, but these errors were encountered: