Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threading #15

Closed
antoine-levitt opened this issue Aug 5, 2019 · 26 comments
Closed

Threading #15

antoine-levitt opened this issue Aug 5, 2019 · 26 comments
Labels
performance Performance regression or performance-related

Comments

@antoine-levitt
Copy link
Member

New developments:
https://julialang.org/blog/2019/07/multithreading
JuliaMath/FFTW.jl#105
JuliaLang/julia#32786
There's also the StridedArrays package that automatically parallelizes broadcasts.
Note that there's a significant overhead for now: https://discourse.julialang.org/t/multithreaded-broadcast/26786, which appears to be a known issue that will get better at some point in the future JuliaLang/julia#32701 (comment)

So it looks like the preferred model will be that julia's scheduler handles all the threading, and the underlying libraries use julia's threads. Essentially this means that we will be able to just set JULIA_NUM_THREADS and get threaded FFT/BLAS from there. If we find out that this is too fine-grained to yield good speedup, we can add explicit annotations (eg @threads on the loop over bands for the Hamiltonian application, or @strided on selected time-intensive broadcasts), and that should work fine.

@mfherbst mfherbst added the performance Performance regression or performance-related label Aug 5, 2019
@mfherbst
Copy link
Member

mfherbst commented Aug 5, 2019

I agree the new partr framework seems to be the way people are heading also for threading support in the lower libraries and it only makes sense to follow along with it. Especially, since users of our code could do all sorts of things on top. Regarding @strided: I think that will really only be helpful at a few places (e.g. in the application of the non-local projectors) where a lot of classical array operations happen on all the bands at once. We'll have to benchmark of course.

@antoine-levitt
Copy link
Member Author

So, I did some very basic experiments. For a system with 400,000 plane waves, FFTW's own threading doesn't seem to do much: setting both FFTW and BLAS threads to the number of cores on my computer gave me a 20% speedup. So we should either do #9, or do our own threading

@mfherbst
Copy link
Member

mfherbst commented Dec 1, 2019

Hmm 20% is surprisingly little, but maybe I misunderstand what you did.

Could you perhaps commit a small benchmark script. I think it would be good to have a few "benchmark cases" or integrate with https://github.com/JuliaCI/PkgBenchmark.jl such that one can track the performance better. What do you think?

@antoine-levitt
Copy link
Member Author

That's set_num_threads for both FFTW and Blas, set to the max number of cores vs 1. Benchmarking is easy : take any example and have more of it (eg set supercell). I don't think we need to setup performance tracking because essentially the only thing that matters right now is how we do the FFTs and how many of them we do, which is simpler to track by hand. The top things that are important right now are convergence criteria for the eigen solver (we do way too many iterations per scf step; by comparison abinit by default does 8 in the first two iterations, and then 4), and batching / threading FFTs.

@antoine-levitt
Copy link
Member Author

@mfherbst can you try the following benchmarking script on the machine you have? https://gist.github.com/antoine-levitt/88086895dd98f746d6c795c99a10fd9f

Here I get

4 threads
N=128, M=40
Single FFT: no threads
  26.611 ms (0 allocations: 0 bytes)
Single FFT: threads
  15.158 ms (78 allocations: 6.66 KiB)
Multiple FFTs: manual, no threads
  1.080 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  631.769 ms (112 allocations: 8.06 KiB)
Multiple FFTs: auto, no threads
  1.083 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  696.880 ms (3281 allocations: 272.52 KiB)
Multiple FFTs: manual_threaded, threads
  679.694 ms (3323 allocations: 275.73 KiB)
Multiple FFTs: auto, threads
  633.797 ms (39 allocations: 3.33 KiB)

So the good news is that all methods of parallelization are esssentially the same. The bad news is that they all suck :-) It looks like FFTs are almost memory-bound, and so do not benefit much from parallelization (at least on my machine). That's on julia 1.3. I'd test on the lab's cluster, but I'm getting proxy errors...

@mfherbst
Copy link
Member

mfherbst commented Dec 1, 2019

My machine (julia 1.3, fftw)

4 threads
N=128, M=40
Single FFT: no threads
  19.447 ms (0 allocations: 0 bytes)
Single FFT: threads
  9.175 ms (78 allocations: 6.66 KiB)
Multiple FFTs: manual, no threads
  792.030 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  373.974 ms (110 allocations: 8.03 KiB)
Multiple FFTs: auto, no threads
  792.610 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  391.993 ms (3243 allocations: 271.92 KiB)
Multiple FFTs: manual_threaded, threads
  377.248 ms (3318 allocations: 275.66 KiB)
Multiple FFTs: auto, threads
  375.433 ms (40 allocations: 3.34 KiB)

@mfherbst
Copy link
Member

mfherbst commented Dec 1, 2019

Cluster08 (julia 1.2, MKL)

16 threads
N=128, M=40
Single FFT: no threads
  43.748 ms (0 allocations: 0 bytes)
Single FFT: threads
  6.878 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
  1.774 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  418.949 ms (182 allocations: 15.58 KiB)
Multiple FFTs: auto, no threads
  1.781 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  370.278 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
  325.279 ms (183 allocations: 15.50 KiB)
Multiple FFTs: auto, threads
  287.693 ms (0 allocations: 0 bytes)

and (again 1.2, MKL)

4 threads
N=128, M=40
Single FFT: no threads
  39.283 ms (0 allocations: 0 bytes)
Single FFT: threads
  11.205 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
  1.751 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  584.677 ms (111 allocations: 7.97 KiB)
Multiple FFTs: auto, no threads
  1.765 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  549.380 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
  298.015 ms (107 allocations: 7.80 KiB)
Multiple FFTs: auto, threads
  496.712 ms (0 allocations: 0 bytes)

@antoine-levitt
Copy link
Member Author

clustern20 (with julia 1.1, I can't make 1.3 work with the proxy for some reason):

16 threads
N=128, M=40
Single FFT: no threads
  32.266 ms (0 allocations: 0 bytes)
Single FFT: threads
  4.336 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
  1.386 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  151.490 ms (53 allocations: 3.23 KiB)
Multiple FFTs: auto, no threads
  1.396 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  248.748 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
  208.934 ms (56 allocations: 3.17 KiB)
Multiple FFTs: auto, threads
  143.142 ms (0 allocations: 0 bytes)
32 threads
N=128, M=40
Single FFT: no threads
  32.257 ms (0 allocations: 0 bytes)
Single FFT: threads
  3.193 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
  1.361 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  156.536 ms (23 allocations: 1.42 KiB)
Multiple FFTs: auto, no threads
  1.550 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  151.108 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
  156.481 ms (24 allocations: 1.48 KiB)
Multiple FFTs: auto, threads
  150.511 ms (0 allocations: 0 bytes)

That's much better. I think that's consistent with FFTs being memory limited, but memory scaling differently on different machines.

Takeaways: oversubscription is fine, FFTW doesn't do better than outer threading. So my suggestion is to plan for a single (like we do now) threaded FFT (by setting FFTW.set_num_threads to JULIA_NUM_THREADS), and add our own threading on top of that. That was fine on 1.1, and should be even better on 1.3. Pity I can't test it on the cluster...

@mfherbst
Copy link
Member

mfherbst commented Dec 1, 2019

Be careful with the 32 threads on cluster 20 ... it has hyper threading enabled, so effectively it's only 16 cores

@antoine-levitt
Copy link
Member Author

Yeah I know, that was basically to test oversubscription

@mfherbst
Copy link
Member

mfherbst commented Dec 1, 2019

Julia 1.3 has changed the way they update the registries in a way that it seems to ignore the proxy settings ... I've had the same issues.

@mfherbst
Copy link
Member

mfherbst commented Dec 1, 2019

For FFTW I think you are right, but for MKL's FFT the picture seems to be different.

@antoine-levitt
Copy link
Member Author

A bit, but maybe the results are too noisy. Can you run the 16 threads test again? I want to see if

Multiple FFTs: manual_threaded, threads
  325.279 ms (183 allocations: 15.50 KiB)
Multiple FFTs: auto, threads
  287.693 ms (0 allocations: 0 bytes)

should be trusted or not.

@mfherbst
Copy link
Member

mfherbst commented Dec 1, 2019

Another run:

Multiple FFTs: manual_threaded, threads
  312.535 ms (182 allocations: 15.58 KiB)
Multiple FFTs: auto, threads
  261.955 ms (0 allocations: 0 bytes)

and yet one more:

Multiple FFTs: manual_threaded, threads
  368.037 ms (180 allocations: 15.16 KiB)
Multiple FFTs: auto, threads
  305.578 ms (0 allocations: 0 bytes)

and on another machine (cc09):

Multiple FFTs: manual_threaded, threads
  211.597 ms (173 allocations: 14.36 KiB)
Multiple FFTs: auto, threads
  147.225 ms (0 allocations: 0 bytes)

@mfherbst
Copy link
Member

mfherbst commented Dec 1, 2019

The difference is similar in each case 50 to 60 ms.

@antoine-levitt
Copy link
Member Author

Hm. So results are inconsistent, but always in the same direction. I'm tempted to ignore... We really should see what it does with 1.3 (or even better, master). There are a few open issues on the julia github about proxies, I posted in one, but proxies are a uniform pain.

@antoine-levitt
Copy link
Member Author

But really, what this all shows is that a single FFT is already pretty well parallelized. Meaning that we can just ignore this and not do any threading at all (ie what we have now), and it'll be within a factor of 2 of the optimal (at least for these sizes). If we just add @threads in the for loop of the FFTs, we'll probably be optimal (or very close, esp. with post-1.2 improvements to threading). Then we should run a large-ish computation on the cluster, see if new bottlenecks appear, and maybe add threading accordingly.

@antoine-levitt
Copy link
Member Author

For proxy issues, see julia issue 33111, that fixed it for me

@antoine-levitt
Copy link
Member Author

So 1.3 improves the manual_threaded for me:

16 threads
N=128, M=40
Single FFT: no threads
  32.412 ms (0 allocations: 0 bytes)
Single FFT: threads
  3.564 ms (298 allocations: 26.22 KiB)
Multiple FFTs: manual, no threads
  1.423 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  155.690 ms (194 allocations: 16.94 KiB)
Multiple FFTs: auto, no threads
  1.415 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  177.217 ms (12010 allocations: 1.03 MiB)
Multiple FFTs: manual_threaded, threads
  143.499 ms (12359 allocations: 1.04 MiB)
Multiple FFTs: auto, threads
  173.176 ms (453 allocations: 37.64 KiB)
32 threads
N=128, M=40
Single FFT: no threads
  34.014 ms (0 allocations: 0 bytes)
Single FFT: threads
  2.989 ms (588 allocations: 52.25 KiB)
Multiple FFTs: manual, no threads
  1.442 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  156.606 ms (306 allocations: 28.81 KiB)
Multiple FFTs: auto, no threads
  1.451 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  170.377 ms (23837 allocations: 2.05 MiB)
Multiple FFTs: manual_threaded, threads
  154.102 ms (24331 allocations: 2.08 MiB)
Multiple FFTs: auto, threads
  144.152 ms (622 allocations: 51.50 KiB)

Still a slight edge for auto FFTW on 32 cores, but that changes from benchmark to benchmark, and when I repeated it manual_threaded was faster. So let's go with #77 and not bother too much.

@mfherbst
Copy link
Member

mfherbst commented Dec 1, 2019

I agree. Especially since this keeps more control on our end and opens way to integrate with the developments happening in Julia in the future.

@antoine-levitt
Copy link
Member Author

OK, let's close this one for now then. We can revisit according to profiling.

@antoine-levitt
Copy link
Member Author

One thing is that FFTW defaults to no threading. Let's keep that manual for now, but note for later that we have to FFTW.set_num_threads, and BLAS.set_num_threads. Also, FFTW threading occurs at plan creation.

@mfherbst
Copy link
Member

mfherbst commented Dec 1, 2019

That is not true. For me it does.

@mfherbst
Copy link
Member

mfherbst commented Dec 1, 2019

See https://github.com/JuliaMath/FFTW.jl/blob/master/src/FFTW.jl#L59. This is activated if nthreads() > 1 and I have by default export JULIA_NUM_THREADS=4, which I think is the way to go with this issue.

@antoine-levitt
Copy link
Member Author

Oh, you're absolutely right, I stopped at https://github.com/JuliaMath/FFTW.jl/blob/master/src/FFTW.jl#L41. They're really confident oversubscription is not a problem then!

@mfherbst
Copy link
Member

mfherbst commented Dec 1, 2019

Indeed. I just saw that, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance regression or performance-related
Projects
None yet
Development

No branches or pull requests

2 participants