-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiling PSF convolve #70
Comments
The issue seems to be that the following pattern
becomes very inefficient for large images (memory allocations?). Assigning before shifting i.e.
goes some way towards fixing the issue but this still repeatedly allocates memory so I have come up with custom pad_and_shift routines which reuses pre-initialised arrays. This seems to have sped things up significantly. The c2r call still takes twice as long as r2c but things seem more balanced now. This should be good enough for the time being but I'll keep this one open because it's always nice to to go faster. |
As discussed one more thing to try (and it does not look like numpy support this) is to expressly ensure that the memory allocation starts at a multiple of the system page size - that way you don't thrash the page cache that often by striding reads over the page boundary. If I recall correctly now from my days doing this in C++ the solution is to over-allocate by at least a page worth of memory and then offset the start pointer to an address that is an integral multiple of the page boundary. This raw pointer can then be wrapped inside an ndarray structure for use in numpy. The alternative (which I have seen makes a huge difference in terms of set and copy operations) is to switch the system to use hugepages -- then there is much less chance of striding the page boundary. Also remember to rebuild numpy to use native instruction set to have any chance of vectorized instruction set usage: maybe @mreineck has some thoughts or may have written an allocator to do page aligned memory wrapping with numpy |
Thanks @bennahugo. I will give that a go. I see there is a ducc0.misc.make_noncritical which may be related |
In this context (heavy accesses to large arrays) I typically see two big sources of slowdown:
I'm not aware of problems that can happen when memory accesses cross page boundaries ... for large arrays they do that all the time, no matter whether the array starts on a page boundary or not. Can you give me a pointer to a discussion of this effect? |
Thanks @mreineck -- I'm thinking of expressly avoiding too many strided accesses on large arrays. That is good to know you have an allocator for this. Maybe @landmanbester should try this upfront before any of the necessary backward steps are invoked. Interesting, I wonder if the hugepage setting has an influence on how often the temporary arrays are given back -- usually writing things in numpy does have the negative effect of not doing things in place. |
I think this is entirely controlled by environment variables for the allocator in use (see https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html for glibc, but you may have to use something dependig on your compiler). |
That is not always possible. Think about FFTing the leading dimension of a 1024x1024 array ... unless you do array transposition (which has its own complications), there will be tons of really nasty strided memory accesses during the operation. These accesses will make the FFT over the leading dimension much slower than over the last dimension:
|
One last thing: the stride of 4096 I mentioned above has nothing to do with the system's memory page size; the numbers are only equal by pure coincidence. Critical stride depends on cache size, cache line length and cache associativity. Unfortunately there is not much documentation available online, but a quick demo is given at http://blog.cs4u.us/2015/07/observing-cache-associativity-effects.html. |
Thanks for the input @mreineck (and apologies for my tardy reply I thought I posted this message a while ago but I must have forgotten to hit the comment button). I definitely see more consistent timings with make_noncritical. I do however observe that c2r takes about double the time of r2c. Is this normal? |
If you do a I'll think about this again ... maybe there is a way around this temporary array now (the FFT library has been quite extensively reworked over the last years). |
You could do the following as an experiment: instead of
you could try something along the lines of
If that improves performance, then the bottleneck is indeed the allocation of the temporary inside |
... and even in the second profile, padding and shifting is still taking way too much time; almost as long as an FFT. I'm sure this can be improved, and I'm thinking of adding this sort of functionality to |
Yeah this totally surprised me. It's a bit of a computationally awkward thing to do right? Here is my hacky attempt with numba Pre-allocating and making things non-critical definitely helps a lot but I didn't expect it to take as much time as it does (theoretically this operation should have less than linear complexity). I'm not sure where the time goes, I'll have to dig a bit deeper. I have been thinking about how cool it would be to be able to call ducc functions from within numba (this is related) but I think that may be non-trivial (pinging @sjperkins as he had some ideas on the topic). It would certainly be cool if you could do the padding and shifting inside ducc but I also understand how this kind of complexity doesn't really fit well with the keep it simple philosophy. I did however note the convolve_axis function and I am wondering, since we are convolving with the PSF all the time, if it may make sense to have a 2D implementation of that (I don't think the Fourier transformed PSF always has an outer product representation). As always open to suggestions and very willing to test whatever you come up with |
In principle it should not be too hard, but there are many details that can go wrong efficiency-wise. Your Numba implementation looks like it should be faster than what you measure; no idea what the JIT compiler is doing there.
I don't think that would be an issue. This would still be standalone functions, doing a small and well-defined job, so it doesn't really clash with the simplicity goal.
Convolve_axis is specifically designed to work with 1D kernels, and practically all of its advantages would go away if it is generalized to a non-separable multi-D kernel. In such a case I think that standard convolution via FFT is still the best thing to do; perhaps with some overlap-and-add trickery, if the kernel is really small. |
I'm wondering: if this is about convolutions of real-valued arrays, perhaps you can avoid the entire |
I did wonder at some point if there would be anything to gain from using Hartley instead of Fourier transforms. I think the slight difference in the convolution formula put me off too quickly. I will look into this again. Thanks @mreineck |
There's one more option: I can introduce a variant of |
Performant Reads/WritesI agree that if you want very good performance, things like page and cache sizes start to matter. This means reading and writing chunks of consecutive data, probably trying to do a page of reads + writes at a time. See for e.g. efficient memory transpose strategies in CUDA. Also note how averaging is performed in steps of 128 baselines in katdal averaging code. It looks like your numba implementation is making 4 random (non consecutive in memory) read accesses and then performing 4 random write accesses, which probably won't be great for cache performance. So to make this faster, I think we'd have to re-arrange the loops, reads and writes: This is also going to be dependent on whether this is possible with an fftshift. @mreineck, please correct me here as you've probably done a lot of this. Perhaps we can adapt some of ducc0's C++ code into numba? Calling ducc0 code from numbare: calling the ducc0 code from numba, I'm aware of two strategies:
Thoughts
To me, it feels like (1) is the direction forward but that depends on whether padding+fftshifting can be done with efficient read/write access patterns, especially if they can be replicated from existing ducc0 code? |
The way I am currently doing the convolution (see here) pre-allocates that array. I am not sure what the consequences would be if I allow r2c to create a new array. Maybe it would just move the problem elsewhere? I think I will start by investigating the Hartley transform route |
I think the code is OK as it is: there are 8 data "streams" (4 reading, 4 writing) that each access memory in a linear fashion. The memory system should be able to handle that, as long as the offsets between the individual streams are not critical strides (and I think we made sure that they aren't). It's of course still possible that Numba will do much better on individual loops, each with a single input and output stream. Concerning calling ducc from numba: I don't really see where this would help much. Ducc routines typically do fairly big chunks of work in a single call, and as long as this is the case, there should be no need to call them from micro-optimized, precompiled code, right? Please let's not get cython or ctypes into the mix ... I'm more than happy that
fftshift has no bad memory access patterns. When array dimensions are even, the operation is completely trivial; when they are odd, it's a bit more involved but still doable. When padding and shifting happens at the same time, things actually become easier, since this is by definition an out-of-place operation.
Actually I'm fine with having this in ducc0. It's just that I'm such a perfectionist that I'll need some time before I have found the "best" possible interface for it.
That's correct. I prefer having separate functions doing small tasks to having huge monolithic functions with dozens of options that can do everything. It's not always avoidable, but in this case I think it is.
Yes, I'm pretty sure. |
I think the new approach should work perfectly fine in this context. The only difference would be that |
Thanks for the input @sjperkins. I certainly don't want to restrict to binary wheels. I actually thought that jitting would bypass this issue. I see pretty major speedups building ducc from source on certain architectures. The other obvious reason for wanting ducc functionality within numba is to avoid the python interpreter. For example, one reason for sticking with numpy's fft's instead of scipy's (see ratt-ru/codex-africanus#211) is because np.fft is available in numba. However, in most cases where I can only partially jit a function I do quite a bit of work between interpreter visits so it may not help much. Cool as it may be to have that functionality, it seems non-trivial and may be a bit off topic for the current discussion |
Eager to see what you come up with, thanks.
Yeah that would be perfectly fine here. As long as the array survives I think its content after the |
I just want to check: would it not be better to read in a cache line of, say 128 bytes with spatial locality in consecutive operations than to issue 4 streams of operations, even if each stream has spatial locality?
I'm thinking of reading in a cache line of 128 bytes and writing them out in 128 bytes cache lines for e.g. This is important for CUDA and, from what I've seen general CPUs but you'll have more XP on this front.
Very much agree :-). I also love pybind11 but haven't found an excuse to use it.... thus far.
Without deeply understanding what's involved here, I suspect that doing something optimal w.r.t. to cache lines and page sizes might be possible in numba.
Ok, I'm sure something will reveal itself after stewing in the unconcious :-)
I absolutely understand that in the context. of C++. We've been spoilt by numba which allows us to concoct unholy things like the Monolithic RIME. This is only possible in the context of a dynamic language like Python.
This also sounds like another possible avenue? To me it seems like there are two pragmatic options: pad+fftshift in numba or ducc0. |
Cool. Note there's also the possibility of building ducc0 and the wrapper during a source installation -- you'd just have to insist on having a C++ compiler installed, which some purists may balk at. |
That's exactly what happens (the array survives,but its content is modified). I have an implementation ready, and it should be available with the next release. |
I'm pretty sure that on the CPU there is no advantage in reading the entire line in one go. The only important thing is that the line is fully read before it is thrown out of the cache by something else. If you have very many streams (say, beyond 20 or so), there may be unrelated problems that slow things down ... for example, the CPU will run out of integer registers to store all the stream pointers. |
How about
? This behaves as if it
In reality it just does one complicated set of nested loops, possibly parallelized, and touches every array entry |
I have a candidate at https://gitlab.mpcdf.mpg.de/mtr/ducc/-/tree/roll_experiments; the function is called |
This version also has the improved |
Thanks, will do. Sorry for the delay. I've been moving over the weekend but
I'll get to it first thing tomorrow morning
On Mon, 09 Jan 2023, 12:32 mreineck, ***@***.***> wrote:
I have a candidate at
https://gitlab.mpcdf.mpg.de/mtr/ducc/-/tree/roll_experiments; the
function is called ducc0.misc.roll_resize_roll. If it's not too hard to
get this branch into your environment, please give it a try!
—
Reply to this email directly, view it on GitHub
<#70 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEAORHOGJABWXSOF5FJC5KLWRPSNBANCNFSM6AAAAAASUKPD7Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
Disclaimer
The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.
This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast, a leader in email security and cyber resilience. Mimecast integrates email defenses with brand protection, security awareness training, web security, compliance and other essential capabilities. Mimecast helps protect large and small organizations from malicious activity, human error and technology failure; and to lead the movement toward building a more resilient world. To find out more, visit our website.
|
I can confirm that c2r and r2c give near identical timings when the |
Sorry for this flashback, but I just noticed this comment from further above:
@bennahugo are you really running |
I managed to replace my clunky padding and shifting functions with |
I'm on numpy 1.22.0 |
@landmanbester, I don't think the amount of shift should depend on the differences between the sizes, but only on the size of the array where you do the shifting. When you call |
Looking at the |
Ah, I have been really silly. I think the shift operation is not even needed here, only the padding. The only thing that needs the shift is the convolution kernel because I need to get the orientation of the PSF with respect to the dirty image right. The following seems to do exactly the same as the previous version of the function
only now psf_convolve_cube is totally dominated by the FFT's (the two blocks under I still don't fully understand where the additional |
Great, that looks much better!
Copies, initialization of |
Hmmm ... you could try to replace
by and
by
It might help a bit, but I don't know. |
Yep, I tried that. It makes almost no difference. Pretty sure the psf convolution was the original issue and since that's fairly well optimized now I will close this one |
My clean implementation seems a bit slow when image sizes get large. I suspect that something weird happens when I go past a certain size (maybe cache size related?). This is an attempt to figure out where the time is spent. Here are the most relevant entries from cprofile
Interestingly, c2r (complex to real FFT) seems to take about twice as long as r2c (the real to complex FFT). Also, a disproportionate amount of time is spent in the fftshift and padding routines. I am not sure what to expect from them. I will try on a smaller image and see if the percentage time spent in the different routines remains the same
The text was updated successfully, but these errors were encountered: