Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling PSF convolve #70

Closed
landmanbester opened this issue Dec 5, 2022 · 41 comments
Closed

Profiling PSF convolve #70

landmanbester opened this issue Dec 5, 2022 · 41 comments

Comments

@landmanbester
Copy link
Collaborator

My clean implementation seems a bit slow when image sizes get large. I suspect that something weird happens when I go past a certain size (maybe cache size related?). This is an attempt to figure out where the time is spent. Here are the most relevant entries from cprofile

         256933 function calls (252261 primitive calls) in 1251.271 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  119.323  119.323 1251.271 1251.271 clark.py:77(clark)
        7  331.328   47.333 1056.260  150.894 psf.py:154(_hessian_reg_psf)
    97/48    2.381    0.025  462.835    9.642 {built-in method numpy.core._multiarray_umath.implement_array_function}
       14    0.000    0.000  283.492   20.249 <__array_function__ internals>:177(roll)
       14  283.491   20.249  283.492   20.249 numeric.py:1140(roll)
        7  227.381   32.483  227.381   32.483 {built-in method ducc0.fft.c2r}
        7    0.000    0.000  141.939   20.277 <__array_function__ internals>:177(ifftshift)
        7    0.000    0.000  141.939   20.277 helper.py:76(ifftshift)
        7    0.014    0.002  141.587   20.227 <__array_function__ internals>:177(fftshift)
        7    0.002    0.000  141.572   20.225 helper.py:19(fftshift)
        7    0.000    0.000  119.148   17.021 <__array_function__ internals>:177(pad)
        7    0.023    0.003  119.147   17.021 arraypad.py:529(pad)
        7   94.833   13.548   94.833   13.548 {built-in method ducc0.fft.r2c}
       21   87.345    4.159   87.345    4.159 arraypad.py:129(_set_pad_area)
       29   57.319    1.977   57.319    1.977 {method 'reduce' of 'numpy.ufunc' objects}
       19    0.001    0.000   57.206    3.011 fromnumeric.py:69(_wrapreduction)
        2    0.000    0.000   38.024   19.012 <__array_function__ internals>:177(amax)
        2    0.000    0.000   38.024   19.012 fromnumeric.py:2675(amax)
        7   31.754    4.536   31.776    4.539 arraypad.py:86(_pad_simple)
        9    0.000    0.000   17.192    1.910 <__array_function__ internals>:177(sum)
        9    0.000    0.000   17.177    1.909 fromnumeric.py:2160(sum)
        7    6.910    0.987    6.910    0.987 clark.py:26(subminor)
        1    5.602    5.602    5.602    5.602 {method 'copy' of 'numpy.ndarray' objects}
        7    0.000    0.000    2.351    0.336 <__array_function__ internals>:177(where)
        1    0.000    0.000    2.167    2.167 dispatcher.py:388(_compile_for_args)
        8    0.000    0.000    2.018    0.252 <__array_function__ internals>:177(any)
        8    0.000    0.000    2.005    0.251 fromnumeric.py:2305(any)
        1    0.000    0.000    1.475    1.475 dispatcher.py:915(compile)
        1    0.000    0.000    1.474    1.474 caching.py:627(load_overload)

Interestingly, c2r (complex to real FFT) seems to take about twice as long as r2c (the real to complex FFT). Also, a disproportionate amount of time is spent in the fftshift and padding routines. I am not sure what to expect from them. I will try on a smaller image and see if the percentage time spent in the different routines remains the same

@landmanbester
Copy link
Collaborator Author

The issue seems to be that the following pattern

y = np.fft.fftshift(np.pad(x, pad_width=10000, mode='constant'), axes=(0,1))

becomes very inefficient for large images (memory allocations?). Assigning before shifting i.e.

xpad = np.pad(x, pad_width=10000, mode='constant')
y = np.fft.fftshift(xpad, axes=(0,1))

goes some way towards fixing the issue but this still repeatedly allocates memory so I have come up with custom pad_and_shift routines which reuses pre-initialised arrays. This seems to have sped things up significantly.

image

The c2r call still takes twice as long as r2c but things seem more balanced now. This should be good enough for the time being but I'll keep this one open because it's always nice to to go faster.

@bennahugo
Copy link
Collaborator

bennahugo commented Dec 7, 2022

As discussed one more thing to try (and it does not look like numpy support this) is to expressly ensure that the memory allocation starts at a multiple of the system page size - that way you don't thrash the page cache that often by striding reads over the page boundary. If I recall correctly now from my days doing this in C++ the solution is to over-allocate by at least a page worth of memory and then offset the start pointer to an address that is an integral multiple of the page boundary. This raw pointer can then be wrapped inside an ndarray structure for use in numpy.

The alternative (which I have seen makes a huge difference in terms of set and copy operations) is to switch the system to use hugepages -- then there is much less chance of striding the page boundary.
image
The change can be made permanent in grub, but to get it temporarily until next reboot do:
echo always > /sys/kernel/mm/transparent_hugepage/enabled

Also remember to rebuild numpy to use native instruction set to have any chance of vectorized instruction set usage:
CFLAGS="-O3 -march=native" pip install --no-binary :all: numpy==1.13.3 #same as system

maybe @mreineck has some thoughts or may have written an allocator to do page aligned memory wrapping with numpy

@landmanbester
Copy link
Collaborator Author

Thanks @bennahugo. I will give that a go. I see there is a ducc0.misc.make_noncritical which may be related

@mreineck
Copy link

mreineck commented Dec 8, 2022

In this context (heavy accesses to large arrays) I typically see two big sources of slowdown:

  • cache thrashing caused by accesses with critical stride. This happens when successive accesses are located at addresses that are a multiple of (usually) 4096 bytes apart, usually in multi-D FFTs on arrays with large power-of-2 sizes. To avoid this, I have written the make_noncritical method which simply allocates arrays that are slightly too large if the strides would be problematic otherwise.
  • frequent memory transfers between the application and the OS. This happens when many large temporary arrays are created and destroyed during a calculation and the memory is passed back to the OS in between. This can be avoided by increasing the threshold value for giving back the memory; I usually do an export MALLOC_TOP_PAD_=8000000000 and see significant speedups from that. (Note the trailing underscore!)

I'm not aware of problems that can happen when memory accesses cross page boundaries ... for large arrays they do that all the time, no matter whether the array starts on a page boundary or not. Can you give me a pointer to a discussion of this effect?

@bennahugo
Copy link
Collaborator

Thanks @mreineck -- I'm thinking of expressly avoiding too many strided accesses on large arrays. That is good to know you have an allocator for this. Maybe @landmanbester should try this upfront before any of the necessary backward steps are invoked. Interesting, I wonder if the hugepage setting has an influence on how often the temporary arrays are given back -- usually writing things in numpy does have the negative effect of not doing things in place.

@mreineck
Copy link

mreineck commented Dec 8, 2022

I wonder if the hugepage setting has an influence on how often the temporary arrays are given back

I think this is entirely controlled by environment variables for the allocator in use (see https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html for glibc, but you may have to use something dependig on your compiler).

@mreineck
Copy link

mreineck commented Dec 8, 2022

I'm thinking of expressly avoiding too many strided accesses on large arrays.

That is not always possible. Think about FFTing the leading dimension of a 1024x1024 array ... unless you do array transposition (which has its own complications), there will be tons of really nasty strided memory accesses during the operation. These accesses will make the FFT over the leading dimension much slower than over the last dimension:

In [1]: import numpy as np

In [2]: import ducc0

In [3]: a=np.zeros((1024,1024),dtype=np.complex128)

In [4]: %timeit ducc0.fft.c2c(a,out=a,axes=(0,))
7.04 ms ± 4.24 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit ducc0.fft.c2c(a,out=a,axes=(1,))
4.29 ms ± 4.59 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: a=ducc0.misc.make_noncritical(a)

In [7]: %timeit ducc0.fft.c2c(a,out=a,axes=(0,))
4.01 ms ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit ducc0.fft.c2c(a,out=a,axes=(1,))
4.24 ms ± 3.08 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@mreineck
Copy link

mreineck commented Dec 8, 2022

One last thing: the stride of 4096 I mentioned above has nothing to do with the system's memory page size; the numbers are only equal by pure coincidence. Critical stride depends on cache size, cache line length and cache associativity. Unfortunately there is not much documentation available online, but a quick demo is given at http://blog.cs4u.us/2015/07/observing-cache-associativity-effects.html.

@landmanbester
Copy link
Collaborator Author

Thanks for the input @mreineck (and apologies for my tardy reply I thought I posted this message a while ago but I must have forgotten to hit the comment button). I definitely see more consistent timings with make_noncritical. I do however observe that c2r takes about double the time of r2c. Is this normal?

@mreineck
Copy link

mreineck commented Jan 5, 2023

If you do a c2r transform that affects more than one axis, then it will be somewhat slower than the analogous r2c one, since it has to allocate a temporary array for storing intermediate results. I'm not sure how much slowdown this causes, but a factor of 2 feels somewhat high.

I'll think about this again ... maybe there is a way around this temporary array now (the FFT library has been quite extensively reworked over the last years).

@mreineck
Copy link

mreineck commented Jan 5, 2023

You could do the following as an experiment: instead of

c2r(a, axes, out=b)

you could try something along the lines of

tmp = ducc0.misc.make_noncritical(np.empty_like(b))  # do this somewhere early, not for every transform
c2c(a, axes[:-1], out=tmp)
c2r(tmp, axes[-1], out=b)

If that improves performance, then the bottleneck is indeed the allocation of the temporary inside c2r.

@mreineck
Copy link

mreineck commented Jan 5, 2023

... and even in the second profile, padding and shifting is still taking way too much time; almost as long as an FFT. I'm sure this can be improved, and I'm thinking of adding this sort of functionality to ducc, but it will take a bit.

@landmanbester
Copy link
Collaborator Author

Yeah this totally surprised me. It's a bit of a computationally awkward thing to do right? Here is my hacky attempt with numba

https://github.com/ratt-ru/pfb-clean/blob/ae78b8ca546fc78447473f888cb1334cff506e43/pfb/utils/misc.py#L1108

Pre-allocating and making things non-critical definitely helps a lot but I didn't expect it to take as much time as it does (theoretically this operation should have less than linear complexity). I'm not sure where the time goes, I'll have to dig a bit deeper. I have been thinking about how cool it would be to be able to call ducc functions from within numba (this is related) but I think that may be non-trivial (pinging @sjperkins as he had some ideas on the topic). It would certainly be cool if you could do the padding and shifting inside ducc but I also understand how this kind of complexity doesn't really fit well with the keep it simple philosophy. I did however note the convolve_axis function and I am wondering, since we are convolving with the PSF all the time, if it may make sense to have a 2D implementation of that (I don't think the Fourier transformed PSF always has an outer product representation). As always open to suggestions and very willing to test whatever you come up with

@mreineck
Copy link

mreineck commented Jan 6, 2023

Yeah this totally surprised me. It's a bit of a computationally awkward thing to do right?

In principle it should not be too hard, but there are many details that can go wrong efficiency-wise. Your Numba implementation looks like it should be faster than what you measure; no idea what the JIT compiler is doing there.

It would certainly be cool if you could do the padding and shifting inside ducc but I also understand how this kind of complexity doesn't really fit well with the keep it simple philosophy.

I don't think that would be an issue. This would still be standalone functions, doing a small and well-defined job, so it doesn't really clash with the simplicity goal.

I did however note the convolve_axis function and I am wondering, since we are convolving with the PSF all the time, if it may make sense to have a 2D implementation of that (I don't think the Fourier transformed PSF always has an outer product representation).

Convolve_axis is specifically designed to work with 1D kernels, and practically all of its advantages would go away if it is generalized to a non-separable multi-D kernel. In such a case I think that standard convolution via FFT is still the best thing to do; perhaps with some overlap-and-add trickery, if the kernel is really small.

@mreineck
Copy link

mreineck commented Jan 6, 2023

I'm wondering: if this is about convolutions of real-valued arrays, perhaps you can avoid the entire r2c/c2r issue by just using Hartley transforms instead. Then you have real arrays even in the Fourier domain and can do more oerations in-place. I haven't thought about it much yet, but replacing both r2c and c2r with r2r_genuine_hartley could "just work" (and if that works, we can perhaps even go to r2r_separable_hartley which might be even faster, but that requires some extra tricks).

@landmanbester
Copy link
Collaborator Author

I did wonder at some point if there would be anything to gain from using Hartley instead of Fourier transforms. I think the slight difference in the convolution formula put me off too quickly. I will look into this again. Thanks @mreineck

@mreineck
Copy link

mreineck commented Jan 6, 2023

There's one more option: I can introduce a variant of c2r which is allowed to overwrite its input array with the intermediate result; that would also eliminate the temporary variable.

@sjperkins
Copy link
Member

Yeah this totally surprised me. It's a bit of a computationally awkward thing to do right? Here is my hacky attempt with numba

https://github.com/ratt-ru/pfb-clean/blob/ae78b8ca546fc78447473f888cb1334cff506e43/pfb/utils/misc.py#L1108

Pre-allocating and making things non-critical definitely helps a lot but I didn't expect it to take as much time as it does (theoretically this operation should have less than linear complexity). I'm not sure where the time goes, I'll have to dig a bit deeper. I have been thinking about how cool it would be to be able to call ducc functions from within numba (this is related) but I think that may be non-trivial (pinging @sjperkins as he had some ideas on the topic). It would certainly be cool if you could do the padding and shifting inside ducc but I also understand how this kind of complexity doesn't really fit well with the keep it simple philosophy. I did however note the convolve_axis function and I am wondering, since we are convolving with the PSF all the time, if it may make sense to have a 2D implementation of that (I don't think the Fourier transformed PSF always has an outer product representation). As always open to suggestions and very willing to test whatever you come up with

Performant Reads/Writes

I agree that if you want very good performance, things like page and cache sizes start to matter. This means reading and writing chunks of consecutive data, probably trying to do a page of reads + writes at a time. See for e.g. efficient memory transpose strategies in CUDA. Also note how averaging is performed in steps of 128 baselines in katdal averaging code.

It looks like your numba implementation is making 4 random (non consecutive in memory) read accesses and then performing 4 random write accesses, which probably won't be great for cache performance. So to make this faster, I think we'd have to re-arrange the loops, reads and writes: This is also going to be dependent on whether this is possible with an fftshift. @mreineck, please correct me here as you've probably done a lot of this. Perhaps we can adapt some of ducc0's C++ code into numba?

Calling ducc0 code from numba

re: calling the ducc0 code from numba, I'm aware of two strategies:

  1. https://numba.readthedocs.io/en/stable/extending/high-level.html#importing-cython-functions would involve writing a cython module that wraps the ducc0 C++ code. AFAIK pfb-clean will then need to be packaged as a binary wheel containing this compiled wrapper.

  2. https://labs.quansight.org/blog/2021/10/cxx-numba-interoperability. Quoting from the blog post

    We provide here a Python script cxx2py.py that auto-generates, from a user-supplied C++ header and source files, the C/C++ wrapper library as well as a Python ctypes wrapper module. The Python module contains ctypes definitions of C++ library functions that Numba compiled functions are able to call directly without requiring the expensive and redundant object transformations mentioned above. An alternative to using ctypes in the Python wrapper module would be to use the Numba Wrapper Address Protocol - WAP . Its usage should be considered if the ctypes "C++ expressiveness" turns out to be insufficient.

    Cons:

    • I suspect it doesn't work with templated code, which from ducc0 produces from memory (@mreineck correct if I'm wrong).
    • The cxx2py.py tool uses CLang to parse C++ header files and to build the C/C++ wrapper shared library. It also uses the RBC - Remote Backend Compiler to parse the signatures of C++ functions for convenience.
    • Thus clang will become a pfb-clean dependency.
    • As it also produces a shared wrapper library, pfb-clean will need to produce binary wheels.

Thoughts

  1. I think there's probably a lot that could be gained from doing the padding and shifting in numba with cache/page efficient access. And then possibly doing it multiple threads. This view might be incorrect if the fftshift demands a random to random access pattern. @landmanbester + @mreineck please tell me if I'm wrong here.
  2. Calling ducc0 from numba seems to involve building another C/C++ shared library to do the padding+shifting and wrapping ducc0, introducing dependencies on pfb-clean, as well as probably a binary wheel.
  3. I'm missing a lot of context here, but I'm sensing that @mreineck would prefer not to support the padding+fftshift op in ducc0. Or perhaps he'd rather not support a pad+fftshift+grid op? Would a pad+fftshift op producing data that can be fed into a grid op be palatable?

To me, it feels like (1) is the direction forward but that depends on whether padding+fftshifting can be done with efficient read/write access patterns, especially if they can be replicated from existing ducc0 code?

@landmanbester
Copy link
Collaborator Author

There's one more option: I can introduce a variant of c2r which is allowed to overwrite its input array with the intermediate result; that would also eliminate the temporary variable.

The way I am currently doing the convolution (see here) pre-allocates that array. I am not sure what the consequences would be if I allow r2c to create a new array. Maybe it would just move the problem elsewhere? I think I will start by investigating the Hartley transform route

@mreineck
Copy link

mreineck commented Jan 6, 2023

It looks like your numba implementation is making 4 random (non consecutive in memory) read accesses and then performing 4 random write accesses, which probably won't be great for cache performance. So to make this faster, I think we'd have to re-arrange the loops, reads and writes.

I think the code is OK as it is: there are 8 data "streams" (4 reading, 4 writing) that each access memory in a linear fashion. The memory system should be able to handle that, as long as the offsets between the individual streams are not critical strides (and I think we made sure that they aren't).

It's of course still possible that Numba will do much better on individual loops, each with a single input and output stream.

Concerning calling ducc from numba: I don't really see where this would help much. Ducc routines typically do fairly big chunks of work in a single call, and as long as this is the case, there should be no need to call them from micro-optimized, precompiled code, right?

Please let's not get cython or ctypes into the mix ... I'm more than happy that pybind11 allows me to avoid them ;-)

I think there's probably a lot that could be gained from doing the padding and shifting in numba with cache/page efficient access. And then possibly doing it multiple threads. This view might be incorrect if the fftshift demands a random to random access pattern. @landmanbester + @mreineck please tell me if I'm wrong here.

fftshift has no bad memory access patterns. When array dimensions are even, the operation is completely trivial; when they are odd, it's a bit more involved but still doable. When padding and shifting happens at the same time, things actually become easier, since this is by definition an out-of-place operation.

I'm missing a lot of context here, but I'm sensing that @mreineck would prefer not to support the padding+fftshift op in ducc0.

Actually I'm fine with having this in ducc0. It's just that I'm such a perfectionist that I'll need some time before I have found the "best" possible interface for it.

Or perhaps he'd rather not support a pad+fftshift+grid op?

That's correct. I prefer having separate functions doing small tasks to having huge monolithic functions with dozens of options that can do everything. It's not always avoidable, but in this case I think it is.

Would a pad+fftshift op producing data that can be fed into a grid op be palatable?

Yes, I'm pretty sure.

@mreineck
Copy link

mreineck commented Jan 6, 2023

The way I am currently doing the convolution (see here) pre-allocates that array. I am not sure what the consequences would be if I allow r2c to create a new array. Maybe it would just move the problem elsewhere? I think I will start by investigating the Hartley transform route

I think the new approach should work perfectly fine in this context. The only difference would be that xhat would get overwritten with garbage by the c2r call, and I don't think that would hurt anyone.

@landmanbester
Copy link
Collaborator Author

Thanks for the input @sjperkins. I certainly don't want to restrict to binary wheels. I actually thought that jitting would bypass this issue. I see pretty major speedups building ducc from source on certain architectures. The other obvious reason for wanting ducc functionality within numba is to avoid the python interpreter. For example, one reason for sticking with numpy's fft's instead of scipy's (see ratt-ru/codex-africanus#211) is because np.fft is available in numba. However, in most cases where I can only partially jit a function I do quite a bit of work between interpreter visits so it may not help much. Cool as it may be to have that functionality, it seems non-trivial and may be a bit off topic for the current discussion

@landmanbester
Copy link
Collaborator Author

Actually I'm fine with having this in ducc0. It's just that I'm such a perfectionist that I'll need some time before I have found the "best" possible interface for it.

Eager to see what you come up with, thanks.

I think the new approach should work perfectly fine in this context. The only difference would be that xhat would get overwritten with garbage by the c2r call, and I don't think that would hurt anyone.

Yeah that would be perfectly fine here. As long as the array survives I think its content after the c2r call is irrelevant

@sjperkins
Copy link
Member

It looks like your numba implementation is making 4 random (non consecutive in memory) read accesses and then performing 4 random write accesses, which probably won't be great for cache performance. So to make this faster, I think we'd have to re-arrange the loops, reads and writes.

I think the code is OK as it is: there are 8 data "stream" (4 reading, 4 writing) that each access memory in a linear fashion. The memory system should be able to handle that, as long as the offsets between the individual streams are not critical strides (and I think we made sure that they aren't).

I just want to check: would it not be better to read in a cache line of, say 128 bytes with spatial locality in consecutive operations than to issue 4 streams of operations, even if each stream has spatial locality?

It's of course still possible that Numba will do much better on individual loops, each with a single input and output stream.

I'm thinking of reading in a cache line of 128 bytes and writing them out in 128 bytes cache lines for e.g. This is important for CUDA and, from what I've seen general CPUs but you'll have more XP on this front.

Concerning calling ducc from numba: I don't really see where this would help much. Ducc routines typically do fairly big chunks of work in a single call, and as long as this is the case, there should be no need to call them from micro-optimized, precompiled code, right?

Please let's not get cython or ctypes into the mix ... I'm more than happy that pybind11 allows me to avoid them ;-)

Very much agree :-). I also love pybind11 but haven't found an excuse to use it.... thus far.

I think there's probably a lot that could be gained from doing the padding and shifting in numba with cache/page efficient access. And then possibly doing it multiple threads. This view might be incorrect if the fftshift demands a random to random access pattern. @landmanbester + @mreineck please tell me if I'm wrong here.

fftshift has no bad memory access patterns. When array dimensions are even, the operation is completely trivial; when they are odd, it's a bit more involved but still doable. When padding and shifting happens at the same time, things actually become easier, since this is by definition an out-of-place operation.

Without deeply understanding what's involved here, I suspect that doing something optimal w.r.t. to cache lines and page sizes might be possible in numba.

I'm missing a lot of context here, but I'm sensing that @mreineck would prefer not to support the padding+fftshift op in ducc0.

Actually I'm fine with having this in ducc0. It's just that I'm such a perfectionist that I'll need some time before I have found the "best" possible interface for it.

Ok, I'm sure something will reveal itself after stewing in the unconcious :-)

Or perhaps he'd rather not support a pad+fftshift+grid op?

That's correct. I prefer having separate functions doing small tasks to having huge monolithic functions with dozens of options that can do everything. It's not always avoidable, but in this case I think it is.

I absolutely understand that in the context. of C++. We've been spoilt by numba which allows us to concoct unholy things like the Monolithic RIME. This is only possible in the context of a dynamic language like Python.

Would a pad+fftshift op producing data that can be fed into a grid op be palatable?

Yes, I'm pretty sure.

This also sounds like another possible avenue? To me it seems like there are two pragmatic options: pad+fftshift in numba or ducc0.

@sjperkins
Copy link
Member

Thanks for the input @sjperkins. I certainly don't want to restrict to binary wheels. I actually thought that jitting would bypass this issue. I see pretty major speedups building ducc from source on certain architectures. The other obvious reason for wanting ducc functionality within numba is to avoid the python interpreter. For example, one reason for sticking with numpy's fft's instead of scipy's (see ratt-ru/codex-africanus#211) is because np.fft is available in numba. However, in most cases where I can only partially jit a function I do quite a bit of work between interpreter visits so it may not help much. Cool as it may be to have that functionality, it seems non-trivial and may be a bit off topic for the current discussion

Cool. Note there's also the possibility of building ducc0 and the wrapper during a source installation -- you'd just have to insist on having a C++ compiler installed, which some purists may balk at.

@mreineck
Copy link

mreineck commented Jan 6, 2023

Yeah that would be perfectly fine here. As long as the array survives I think its content after the c2r call is irrelevant

That's exactly what happens (the array survives,but its content is modified). I have an implementation ready, and it should be available with the next release.

@mreineck
Copy link

mreineck commented Jan 6, 2023

I just want to check: would it not be better to read in a cache line of, say 128 bytes with spatial locality in consecutive operations than to issue 4 streams of operations, even if each stream has spatial locality?

I'm pretty sure that on the CPU there is no advantage in reading the entire line in one go. The only important thing is that the line is fully read before it is thrown out of the cache by something else.

If you have very many streams (say, beyond 20 or so), there may be unrelated problems that slow things down ... for example, the CPU will run out of integer registers to store all the stream pointers.

@mreineck
Copy link

mreineck commented Jan 6, 2023

How about

def roll_resize_roll(in, r1, r2, out)

? This behaves as if it

  • rolls in by r1, putting the result into tmp
  • copies tmp into out, padding/truncating it as necessary
  • rolls out in-place by r2

In reality it just does one complicated set of nested loops, possibly parallelized, and touches every array entry exactly at most once. I think this should cover all use cases you need, @landmanbester.

@mreineck
Copy link

mreineck commented Jan 9, 2023

I have a candidate at https://gitlab.mpcdf.mpg.de/mtr/ducc/-/tree/roll_experiments; the function is called ducc0.misc.roll_resize_roll. If it's not too hard to get this branch into your environment, please give it a try!

@mreineck
Copy link

mreineck commented Jan 9, 2023

This version also has the improved c2r function with the allow_overwriting_input flag.

@landmanbester
Copy link
Collaborator Author

landmanbester commented Jan 9, 2023 via email

@landmanbester
Copy link
Collaborator Author

This version also has the improved c2r function with the allow_overwriting_input flag.

I can confirm that c2r and r2c give near identical timings when the allow_overwriting_input flag is set. Thanks @mreineck

@mreineck
Copy link

Sorry for this flashback, but I just noticed this comment from further above:

Also remember to rebuild numpy to use native instruction set to have any chance of vectorized instruction set usage:
CFLAGS="-O3 -march=native" pip install --no-binary :all: numpy==1.13.3 #same as system

@bennahugo are you really running numpy 1.13 on that machine? I wouldn't touch that with a 10 foot pole anymore ;-)

@landmanbester
Copy link
Collaborator Author

I have a candidate at https://gitlab.mpcdf.mpg.de/mtr/ducc/-/tree/roll_experiments; the function is called ducc0.misc.roll_resize_roll. If it's not too hard to get this branch into your environment, please give it a try!

I managed to replace my clunky padding and shifting functions with roll_resize_roll for even inputs. For the record, when the array sizes are even, pad_and_shift(xin, xout) is equivalent to roll_resize_roll(xin, xout, (0, 0), (nxo-nxi//2, nyo-nyi//2) with xin.shape = (nxi, nyi) and xout.shape = (nxo, nyo). unpad_and_unshift(xout, xin) (after a small bug fix) is equivalent to roll_resize_roll(xout, xin, (-(nxo-nxi//2), -(nyo-nyi//2)), (0, 0)). I am still trying to figure out what to do for odd sized arrays (note my numba versions only handled even sized arrays). I'll get to profiling these once I have a working solution. Thanks @mreineck

@landmanbester
Copy link
Collaborator Author

@bennahugo are you really running numpy 1.13 on that machine? I wouldn't touch that with a 10 foot pole anymore ;-)

I'm on numpy 1.22.0

@mreineck
Copy link

@landmanbester, I don't think the amount of shift should depend on the differences between the sizes, but only on the size of the array where you do the shifting.

When you call pad_and_shift, where is the "real image center" of the input array, and where should it be after pad_and_shift?

@mreineck
Copy link

Looking at the psf_convolve_* functions, actually it should not matter at all how much you roll the padded images, as long as you roll them back by the same amount after the convolution. If that is correct, then you only need to worry about the correct amount of rolling before padding and after unpadding.

@landmanbester
Copy link
Collaborator Author

Ah, I have been really silly. I think the shift operation is not even needed here, only the padding. The only thing that needs the shift is the convolution kernel because I need to get the orientation of the PSF with respect to the dirty image right. The following seems to do exactly the same as the previous version of the function

def psf_convolve_cube(xpad,    # preallocated array to store padded image
                      xhat,    # preallocated array to store FTd image
                      xout,    # preallocated array to store output image
                      psfhat,
                      lastsize,
                      x,       # input image, not overwritten
                      nthreads=1):
    _, nx, ny = x.shape
    xpad[...] = 0.0
    xpad[:, 0:nx, 0:ny] = x
    r2c(xpad, axes=(1, 2), nthreads=nthreads,
        forward=True, inorm=0, out=xhat)
    xhat *= psfhat
    c2r(xhat, axes=(1, 2), forward=False, out=xpad,
        lastsize=lastsize, inorm=2, nthreads=nthreads,
        allow_overwriting_input=True)
    xout[...] = xpad[:, 0:nx, 0:ny]
    return xout

only now psf_convolve_cube is totally dominated by the FFT's (the two blocks under psf_convolve_cube are for r2c and c2r)

image

I still don't fully understand where the additional 16.74 - 6.23 - 6.30 = 4.21 seconds are going but my guess is mem_copies. Not sure how to drill down deeper into that but maybe this is deep enough (although suggestions welcome). Thanks for all the input @mreineck. I'll still need the roll_resize_roll function when I need to keep track of what things look like in Fourier space

@mreineck
Copy link

Great, that looks much better!

I still don't fully understand where the additional 16.74 - 6.23 - 6.30 = 4.21 seconds are going but my guess is mem_copies.

Copies, initialization of xpad and the multiplication with psfhat. That touches a lot of memory, and it's (I think) not parallelized. Still, even if that was optimized even more, you won't be able to get more than around 20% speedup, and that's probably not worth it.

@mreineck
Copy link

Hmmm ... you could try to replace

    xpad[...] = 0.0
    xpad[:, 0:nx, 0:ny] = x

by
ducc0.misc,roll_resize_roll(x,xpad,(0,0,0),(0,0,0),nthreads)

and

 xout[...] = xpad[:, 0:nx, 0:ny]

by

ducc0.misc.roll_resize_roll(xpad,xout,(0,0,0),(0,0,0),nthreads)

It might help a bit, but I don't know.

@landmanbester landmanbester changed the title Profiling clean Profiling PSF convolve Jan 10, 2023
@landmanbester
Copy link
Collaborator Author

Yep, I tried that. It makes almost no difference. Pretty sure the psf convolution was the original issue and since that's fairly well optimized now I will close this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants