Chi sqrd gradient #147

marziarivi · 2016-07-18T17:41:32Z

CPU and GPU chi squared gradient implemented for the v4 version. Branch updated with the current master version. Example implemented to compare analytical gradient with numerical gradient computed as chi squared incremental ratio.

sjperkins · 2016-07-20T12:38:40Z

@marziarivi I'm keen to merge this into master. I haven't had a chance to look at it yet, but I aim to do this before master significantly diverges again.

marziarivi · 2016-07-20T13:34:19Z

Great! Following your scheme, I am now trying to add it to version v5 as well in order to use many GPUs for my simulations.

On 20 Jul 2016, at 13:38, Simon Perkins [email protected] wrote:

@marziarivi I'm keen to merge this into master. I haven't had a chance to look at it yet, but I aim to do this before master significantly diverges again.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

marziarivi · 2016-07-20T15:31:20Z

To implement the multi-GPU gradient, I need to understand few things:

Looking at the CompositeSolver it is not clear to me if copy to each GPU the full solver arrays or only the related part. I see you compute how to split the extension of each array but then it seems to me you have a copy of all of it in each GPUs. I am pretty sure I am missing something because the reason to use more GPUs is to split the data among them when the memory required is higher than the one available on a single GPU.
It seems you treat every solver array, therefore also the X2_grad(3,nssrc) I added should be considered and I should worry about it. However a must initialise it to zero, what is the best way to do this? Maybe I can do this in the create_arrays function called from the CompositeSolver?
I can create a python wrapper for the SersicChiSquaredGradient as you did for the other kernels and then call it in the solve function after self.rime_reduce.execute(self). I just need to be sure that observed_vis and visibilities are still in the GPU memory when the gradient kernel runs.

Do you agree?

On 20 Jul 2016, at 14:34, Marzia Rivi [email protected] wrote:

Great! Following your scheme, I am now trying to add it to version v5 as well in order to use many GPUs for my simulations.

On 20 Jul 2016, at 13:38, Simon Perkins [email protected] wrote:

@marziarivi I'm keen to merge this into master. I haven't had a chance to look at it yet, but I aim to do this before master significantly diverges again.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

sjperkins · 2016-07-20T16:12:54Z

Looking at the CompositeSolver it is not clear to me if copy to each GPU the full solver arrays or only the related part. I see you compute how to split the extension of each array but then it seems to me you have a copy of all of it in each GPUs. I am pretty sure I am missing something because the reason to use more GPUs is to split the data among them when the memory required is higher than the one available on a single GPU.

You're correct, it only copies the related part. _thread_budget takes the array configuration in config.py (where you defined X2_grad) and reduces dimensions until all arrays can fit within the GPU for a given memory budget.

Then, the _gen_vis_slices method iterates over chunks of the ntime, nbl and nchan dimensions, creating slices for each dimensions on each iteration.

Each chunk is enqueued for execution on a thread associated with each device. The dimension slices are passed through to to this thread. For each chunk, we iterate over source batches, further slicing over the source dimensions (nsrc, ngsrc, nssrc).

Then, all arrays required for a specific kernel are copied into pinned memory and an async memory copy to the GPU is scheduled on a CUDA stream. Directly after this, the CUDA kernel is enqueued on the the CUDA stream. See here for instance.

When enqueueing arrays, the given CPU (and GPU) dimensions slices are used to create views of the CPU and GPU arrays to copy to/from.

It seems you treat every solver array, therefore also the X2_grad(3,nssrc) I added should be considered and I should worry about it. However a must initialise it to zero, what is the best way to do this? Maybe I can do this in the create_arrays function called from the CompositeSolver?

I see you've classified X2_grad as GPU_SCRATCH, which is fine if it is an output array similar to model_vis. I'd imagine you'll have to zero X2_grad on the first source batch like I do for model visibilities here

Model visibilities are accumulated for each separate source batch on the GPU. They are zeroed on the first batch (this is ignored if the VISIBILITY_ACCUMULATION is configured) and only after visibilities from the last batch have been accumulated is model_vis on the GPU copied back onto the CPU.

I can create a python wrapper for the SersicChiSquaredGradient as you did for the other kernels and then call it in the solve function after self.rime_reduce.execute(self). I just need to be sure that observed_vis and visibilities are still in the GPU memory when the gradient kernel runs.

Array data for each associated visibility chunk and source batch (and hence model_vis and observed_vis) is always on the GPU, and the GPU kernels have this information to reason about where they are in the global problem space. See here for instance.

marziarivi · 2016-07-20T17:56:16Z

On 20 Jul 2016, at 17:12, Simon Perkins [email protected] wrote:

Looking at the CompositeSolver it is not clear to me if copy to each GPU the full solver arrays or only the related part. I see you compute how to split the extension of each array but then it seems to me you have a copy of all of it in each GPUs. I am pretty sure I am missing something because the reason to use more GPUs is to split the data among them when the memory required is higher than the one available on a single GPU.
You're correct, it only copies the related part. _thread_budget takes the array configuration in config.py (where you defined X2_grad) and reduces dimensions until all arrays can fit within the GPU for a given memory budget.

Then, the _gen_vis_slices method iterates over chunks of the ntime, nbl and nchan dimensions, creating slices for each dimensions on each iteration.

Each chunk is enqueued for execution on a thread associated with each device.

Ok, so you split arrays among GPUs only slicing along ntime,nbl,nchan, whereas each GPU process all the sources. Then my X2_grad (3,nssrc) wouldn’t be splitter among GPUs, would it? (Which is fine!)
The dimension slices are passed through to to this thread. For each chunk, we iterate over source batches, further slicing over the source dimensions (nsrc, ngsrc, nssrc). Then, all arrays required for a specific kernel are copied into pinned memory and an async memory copy to the GPU is scheduled on a CUDA stream. Directly after this, the CUDA kernel is enqueued on the the CUDA stream. See here for instance.

Sorry, I am not familiar with pyCUDA, are you saying that each CUDA stream refers to the sequence of kernels (E_beam, B_sqrt, ...) for the same source slice and then the same for the next slices?
So I have to wait until all the CUDA streams (i.e. all your kernels for all source slices) are finished before I can call the gradient computation (as it needs the final value of each visibility considered). This means I should add another loop over the source slices
for src_cpu_slice_map, src_gpu_slice_map in self._gen_source_slices():
after yours, where to call the kernel gradient.
When enqueueing arrays, the given CPU (and GPU) dimensions slices are used to create views of the CPU and GPU arrays to copy to/from.

It seems you treat every solver array, therefore also the X2_grad(3,nssrc) I added should be considered and I should worry about it. However a must initialise it to zero, what is the best way to do this? Maybe I can do this in the create_arrays function called from the CompositeSolver?
I see you've classified X2_grad as GPU_SCRATCH, which is fine if it is an output array similar to model_vis. I'd imagine you'll have to zero X2_grad on the first source batch like I do for model visibilities here

Model visibilities are accumulated for each separate source batch on the GPU. They are zeroed on the first batch (this is ignored if the VISIBILITY_ACCUMULATION is configured) and only after visibilities from the last batch have been accumulated is model_vis on the GPU copied back onto the CPU.

My case is different: I cannot set X2_grad equal to zero within the kernel because CUDA blocks are executed asynchronously and all of them contribute to all the elements of X2_grad related to the current sources slice. Each block computes for each Sersic source the contribution given by its slice of uv points, and at the end of its execution accumulates with atomic operations a single contribution to each element of X2_grad related to the current chunk of particles.

Therefore the X2_grad must be created (without slicing) and set to zero on each GPU before calling the gradient kernel.
What is the best way to do this?

I can create a python wrapper for the SersicChiSquaredGradient as you did for the other kernels and then call it in the solve function after self.rime_reduce.execute(self). I just need to be sure that observed_vis and visibilities are still in the GPU memory when the gradient kernel runs.
Array data for each associated visibility chunk and source batch (and hence model_vis and observed_vis) is always on the GPU, and the GPU kernels have this information to reason about where they are in the global problem space. See here for instance.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

sjperkins · 2016-07-21T16:58:36Z

Ok, so you split arrays among GPUs only slicing along ntime,nbl,nchan, whereas each GPU process all the sources. Then my X2_grad (3,nssrc) wouldn’t be splitter among GPUs, would it? (Which is fine!)

I've had a brief look at your code. I think what will happen is that a X2_grad (3, nssrc) will be generated per chunk. These X2_grads will then need to be added together to produce a final X2_grad. The way this works in the Compose Solver is that chunks are submitted to the thread associated with each GPU via a promises and futures framework. The future currently returns a X2, and model_vis associated with that chunk. These values need to either accumulated (in the case of the X2 or written out to the appropriate portion of the CPU model_vis array. See here for instance.

Sorry, I am not familiar with pyCUDA, are you saying that each CUDA stream refers to the sequence of kernels (E_beam, B_sqrt, ...) for the same source slice and then the same for the next slices? So I have to wait until all the CUDA streams (i.e. all your kernels for all source slices) are finished before I can call the gradient computation (as it needs the final value of each visibility considered). This means I should add another loop over the source slices for src_cpu_slice_map, src_gpu_slice_map in self._gen_source_slices(): after yours, where to call the kernel gradient.

Asynchronous operations (CUDA memory copies, CUDA kernels) are submitted to a CUDA stream and are executed in a non-blocking manner. Note that the chunk is enqueued on a device by submitting an enqueue operation on the thread associated with each GPU. Each thread is handled using a promises and futures framework. This enqueue operation returns a (possibly not completed) enqueue_future which, in turn is submitted to a synchronisation thread here. _sync_wait then blocks waiting for the result of the enqueue_future, and finally itself returns a future which the the solve() method (occasionally) blocks on here and here.

My case is different: I cannot set X2_grad equal to zero within the kernel because CUDA blocks are executed asynchronously and all of them contribute to all the elements of X2_grad related to the current sources slice. Each block computes for each Sersic source the contribution given by its slice of uv points, and at the end of its execution accumulates with atomic operations a single contribution to each element of X2_grad related to the current chunk of particles. Therefore the X2_grad must be created (without slicing) and set to zero on each GPU before calling the gradient kernel. What is the best way to do this?

I disagree. I think you can zero your X2_grad per chunk and then sum them as they are computed per chunk. This is what I've done with the X2 -- each chunk's X2 contributes to the total.

Finally, I notice that you've used atomics to compute the reduction for you X2_grad, compared to outputting each term individually For the X2 for e.g. Montblanc has a (ntime,nbl,nchan) chi_sqrd_result array which is summed using a PyCUDA reduction. I wonder how efficient using atomics to perform the reduction, compared to a parallel reduction. This suggests that the atomicAdds will be serialised.

The code you see above is very simple. Each thread reads a value from memory sequentially, which is quite fast. If the sum of all those numbers is needed, you might think it would be okay to simply use an atomicAdd operations. This would effectively calculate the final sum in one line of code, which might seem great. Unfortunately, all of those operations will be serialized, which is extremely slow. If you have 512 threads per thread block, each block would have to do 512 sequential additions. However, using a reduction method discussed in a previous tutorial would be able to accomplish the same task with just 511 additions. The key here is that these additions can be done in parallel. Generally, 16 or 32 additions can be done completely parallel, making it much faster than using an atomicAdd for a reduction problem such as this. So whenever you use atomic operations, be sure to program such that there won’t need to be too many sequential operations. Failure to do so will result in a dramatic loss of parallelism, and thus a dramatic loss in performance. However, when atomic operations are used correctly, they are extremely useful.

marziarivi · 2016-07-22T15:20:37Z

On 21 Jul 2016, at 17:58, Simon Perkins [email protected] wrote:

Ok, so you split arrays among GPUs only slicing along ntime,nbl,nchan, whereas each GPU process all the sources. Then my X2_grad (3,nssrc) wouldn’t be splitter among GPUs, would it? (Which is fine!)

I've had a brief look at your code. I think what will happen is that a X2_grad (3, nssrc) will be generated per chunk. These X2_grads will then need to be added together to produce a final X2_grad. The way this works in the Compose Solver is that chunks are submitted to the thread associated with each GPU via a promises and futures framework. The future currently returns a X2, and model_vis associated with that chunk. These values need to either accumulated (in the case of the X2 or written out to the appropriate portion of the CPU model_vis array. See here for instance.

Ok, I agree.
Sorry, I am not familiar with pyCUDA, are you saying that each CUDA stream refers to the sequence of kernels (E_beam, B_sqrt, ...) for the same source slice and then the same for the next slices? So I have to wait until all the CUDA streams (i.e. all your kernels for all source slices) are finished before I can call the gradient computation (as it needs the final value of each visibility considered). This means I should add another loop over the source slices for src_cpu_slice_map, src_gpu_slice_map in self._gen_source_slices(): after yours, where to call the kernel gradient.

Asynchronous operations (CUDA memory copies, CUDA kernels) are submitted to a CUDA stream and are executed in a non-blocking manner. Note that the chunk is enqueued on a device by submitting an enqueue operation on the thread associated with each GPU. Each thread is handled using a promises and futures framework. This enqueue operation returns a (possibly not completed) enqueue_future which, in turn is submitted to a synchronisation thread here. _sync_wait then blocks waiting for the result of the enqueue_future, and finally itself returns a future which the the solve() method (occasionally) blocks on here and here.

Ok. So I should add another loop here, in order to have the final values for the model visibilities of the current chunk (i.e. after the iteration over all source slices).

for src_cpu_slice_map, src_gpu_slice_map in self._gen_source_slices():
# Update our maps with source slice information
# Configure dimension extents and global size on the sub-solver
# Enqueue X2 gradient computation

Then add X2_grad reduction as for X2.

My case is different: I cannot set X2_grad equal to zero within the kernel because CUDA blocks are executed asynchronously and all of them contribute to all the elements of X2_grad related to the current sources slice. Each block computes for each Sersic source the contribution given by its slice of uv points, and at the end of its execution accumulates with atomic operations a single contribution to each element of X2_grad related to the current chunk of particles. Therefore the X2_grad must be created (without slicing) and set to zero on each GPU before calling the gradient kernel. What is the best way to do this?

I disagree. I think you can zero your X2_grad per chunk and then sum them as they are computed per chunk. This is what I've done with the X2 -- each chunk's X2 contributes to the total.

Yes this is true and must be done in the same way. But for each chunk I cannot zero X2_grad within the kernel because each cuda block accumulates over the same array elements at the end of each particle processing and there is no way to synchronise cuda blocks at the beginning of the kernel. We should create directly on the GPU memory X2_grad set to zero before launching the kernel. This is what I did in v4.
Finally, I notice that you've used atomics to compute the reduction for you X2_grad, compared to outputting each term individually For the X2 for e.g. Montblanc has a (ntime,nbl,nchan) chi_sqrd_result array which is summed using a PyCUDA reduction. I wonder how efficient using atomics to perform the reduction, compared to a parallel reduction. This suggests that the atomicAdds will be serialised.

The code you see above is very simple. Each thread reads a value from memory sequentially, which is quite fast. If the sum of all those numbers is needed, you might think it would be okay to simply use an atomicAdd operations. This would effectively calculate the final sum in one line of code, which might seem great. Unfortunately, all of those operations will be serialized, which is extremely slow. If you have 512 threads per thread block, each block would have to do 512 sequential additions. However, using a reduction method discussed in a previous tutorial would be able to accomplish the same task with just 511 additions. The key here is that these additions can be done in parallel. Generally, 16 or 32 additions can be done completely parallel, making it much faster than using an atomicAdd for a reduction problem such as this. So whenever you use atomic operations, be sure to program such that there won’t need to be too many sequential operations. Failure to do so will result in a dramatic loss of parallelism, and thus a dramatic loss in performance. However, when atomic operations are used correctly, they are extremely useful.

Yes I know, this is why I did it hierarchically so that only one thread per block accesses the device memory only once, and threads within the block access only the shared memory (which is much faster). The reason for my choice is to reduce the amount of memory required. Using your approach would have required to allocate an array of size (3,nssrc,nbl,ntime,nchan) which is very large in case of many sources (as it is for weak lensing). In the v4 version this would have limited the amount of sources I could process. In v5 version we could actually split this array as the others but I am not sure it would be more performant because it means writing on the device memory for each source for each uv point and make a lot of GPU-CPU data transfer (for large problems), which are the slowest operations.

A solution that would avoid serialising atomic operations on the device memory and reduce atomics on the shared memory is to split sources among blocks, so that each block would access different parts of X2_grad and process a smaller number of sources at a time. But this would require a different grid structure for this kernel, e.g. blockDimz = sources_per_block and loop over the channel slices instead of over the sources (if possible). What do you think?

Maybe chatting via Skype would be easier for further explanation…

…radient is not enabled

Provides more flexibility for different result types

This reverts commit 7457d38.

…unction

…selines and ntimes

…npol=1

…nt the case for a single channel where num threadIdx.x = 1 < 3

ratt-priv-ci · 2022-01-03T19:42:17Z

Can one of the admins verify this patch?

marziarivi added 6 commits July 15, 2016 14:09

GPU chi squared gradient with respect ro Sersic parameters

be1c515

CPU chi squared gradient with respect ro Sersic parameters

9a5623a

added chi squared gradient option

667aec1

added chi squared gradient option

ddc2f0a

Sersic gradient example

21dc3af

Sersic gradient example

05433fd

This was referenced Jul 19, 2016

Chi-Squared Gradient implementation. #91

Closed

chi_sqrd gradient #85

Open

uncommented multiplication by the G term

7541c01

sjperkins mentioned this pull request Jul 22, 2016

v4 CPUSolver references vis variable, subsequently renamed to model_vis #154

Closed

marziarivi and others added 11 commits July 28, 2016 16:52

added minor optimisations

b59b4e0

move gradient elements product by 6 in the kernel

6ad91b9

changed X2_grad classifier, now SIMULATOR_OUTPUT

d083bd5

minor changes

32734c2

minor changes

30f1001

fixed bug #152 as in the master merge #156

e0d0467

created multi-gpu gradient example

0761ef5

add X2 gradient computation within multi-gpu solver

0e4e4a1

add X2 gradient computation version 5

6f7e2c0

removed print instructions

11b85ea

fixed misalignment

7a95988

sjperkins force-pushed the master branch from 02fcc6d to 9be5e37 Compare August 16, 2016 11:56

bug fixed

fc486a3

Marzia Rivi and others added 15 commits August 17, 2016 14:52

Composite solver merged with the master

4738328

minor change

3d5d068

bug fixed: X2_grad initialisation

7bd82e9

minor change

8e0866e

change X2_grad classifier into GPU_SCRATCH to manage the case where g…

5abd1e2

…radient is not enabled

Fixed X2_grad array management in case of more src chunks

a65b50f

Merge branch 'master' into chi-simon

4becebd

Pass results via dict, instead of tuple

7457d38

Provides more flexibility for different result types

Revert "Pass results via dict, instead of tuple"

15e99b8

This reverts commit 7457d38.

replaced atomic addition in the shared memory with the block_reduce f…

4879f6b

…unction

fixed block_reduce bug in case of block dimensions not factors of nba…

6e30f12

…selines and ntimes

Enabled single polarization for GPU version

2795b5f

allowed solver reading context from the slvr_cfg

a3b51fd

bugs fixed: E_beam and G_term matrix default, RimeBSqrt function for …

fc46cbb

…npol=1

bug fixed for the case npol=1: final reduction didn't take into accou…

4d06dbd

…nt the case for a single channel where num threadIdx.x = 1 < 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chi sqrd gradient #147

Chi sqrd gradient #147

marziarivi commented Jul 18, 2016

sjperkins commented Jul 20, 2016

marziarivi commented Jul 20, 2016

marziarivi commented Jul 20, 2016

sjperkins commented Jul 20, 2016

marziarivi commented Jul 20, 2016

sjperkins commented Jul 21, 2016

marziarivi commented Jul 22, 2016

ratt-priv-ci commented Jan 3, 2022

Chi sqrd gradient #147

Are you sure you want to change the base?

Chi sqrd gradient #147

Conversation

marziarivi commented Jul 18, 2016

sjperkins commented Jul 20, 2016

marziarivi commented Jul 20, 2016

marziarivi commented Jul 20, 2016

sjperkins commented Jul 20, 2016

marziarivi commented Jul 20, 2016

sjperkins commented Jul 21, 2016

marziarivi commented Jul 22, 2016

ratt-priv-ci commented Jan 3, 2022