Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chi sqrd gradient #147

Open
wants to merge 34 commits into
base: master
Choose a base branch
from
Open

Chi sqrd gradient #147

wants to merge 34 commits into from

Conversation

marziarivi
Copy link
Contributor

CPU and GPU chi squared gradient implemented for the v4 version. Branch updated with the current master version. Example implemented to compare analytical gradient with numerical gradient computed as chi squared incremental ratio.

This was referenced Jul 19, 2016
@sjperkins
Copy link
Member

@marziarivi I'm keen to merge this into master. I haven't had a chance to look at it yet, but I aim to do this before master significantly diverges again.

@marziarivi
Copy link
Contributor Author

Great! Following your scheme, I am now trying to add it to version v5 as well in order to use many GPUs for my simulations.

On 20 Jul 2016, at 13:38, Simon Perkins [email protected] wrote:

@marziarivi I'm keen to merge this into master. I haven't had a chance to look at it yet, but I aim to do this before master significantly diverges again.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@marziarivi
Copy link
Contributor Author

To implement the multi-GPU gradient, I need to understand few things:

  1. Looking at the CompositeSolver it is not clear to me if copy to each GPU the full solver arrays or only the related part. I see you compute how to split the extension of each array but then it seems to me you have a copy of all of it in each GPUs. I am pretty sure I am missing something because the reason to use more GPUs is to split the data among them when the memory required is higher than the one available on a single GPU.
  2. It seems you treat every solver array, therefore also the X2_grad(3,nssrc) I added should be considered and I should worry about it. However a must initialise it to zero, what is the best way to do this? Maybe I can do this in the create_arrays function called from the CompositeSolver?
  3. I can create a python wrapper for the SersicChiSquaredGradient as you did for the other kernels and then call it in the solve function after self.rime_reduce.execute(self). I just need to be sure that observed_vis and visibilities are still in the GPU memory when the gradient kernel runs.

Do you agree?

On 20 Jul 2016, at 14:34, Marzia Rivi [email protected] wrote:

Great! Following your scheme, I am now trying to add it to version v5 as well in order to use many GPUs for my simulations.

On 20 Jul 2016, at 13:38, Simon Perkins [email protected] wrote:

@marziarivi I'm keen to merge this into master. I haven't had a chance to look at it yet, but I aim to do this before master significantly diverges again.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@sjperkins
Copy link
Member

  1. Looking at the CompositeSolver it is not clear to me if copy to each GPU the full solver arrays or only the related part. I see you compute how to split the extension of each array but then it seems to me you have a copy of all of it in each GPUs. I am pretty sure I am missing something because the reason to use more GPUs is to split the data among them when the memory required is higher than the one available on a single GPU.

You're correct, it only copies the related part. _thread_budget takes the array configuration in config.py (where you defined X2_grad) and reduces dimensions until all arrays can fit within the GPU for a given memory budget.

Then, the _gen_vis_slices method iterates over chunks of the ntime, nbl and nchan dimensions, creating slices for each dimensions on each iteration.

Each chunk is enqueued for execution on a thread associated with each device. The dimension slices are passed through to to this thread. For each chunk, we iterate over source batches, further slicing over the source dimensions (nsrc, ngsrc, nssrc).

Then, all arrays required for a specific kernel are copied into pinned memory and an async memory copy to the GPU is scheduled on a CUDA stream. Directly after this, the CUDA kernel is enqueued on the the CUDA stream. See here for instance.

When enqueueing arrays, the given CPU (and GPU) dimensions slices are used to create views of the CPU and GPU arrays to copy to/from.

  1. It seems you treat every solver array, therefore also the X2_grad(3,nssrc) I added should be considered and I should worry about it. However a must initialise it to zero, what is the best way to do this? Maybe I can do this in the create_arrays function called from the CompositeSolver?

I see you've classified X2_grad as GPU_SCRATCH, which is fine if it is an output array similar to model_vis. I'd imagine you'll have to zero X2_grad on the first source batch like I do for model visibilities here

Model visibilities are accumulated for each separate source batch on the GPU. They are zeroed on the first batch (this is ignored if the VISIBILITY_ACCUMULATION is configured) and only after visibilities from the last batch have been accumulated is model_vis on the GPU copied back onto the CPU.

  1. I can create a python wrapper for the SersicChiSquaredGradient as you did for the other kernels and then call it in the solve function after self.rime_reduce.execute(self). I just need to be sure that observed_vis and visibilities are still in the GPU memory when the gradient kernel runs.

Array data for each associated visibility chunk and source batch (and hence model_vis and observed_vis) is always on the GPU, and the GPU kernels have this information to reason about where they are in the global problem space. See here for instance.

@marziarivi
Copy link
Contributor Author

On 20 Jul 2016, at 17:12, Simon Perkins [email protected] wrote:

Looking at the CompositeSolver it is not clear to me if copy to each GPU the full solver arrays or only the related part. I see you compute how to split the extension of each array but then it seems to me you have a copy of all of it in each GPUs. I am pretty sure I am missing something because the reason to use more GPUs is to split the data among them when the memory required is higher than the one available on a single GPU.
You're correct, it only copies the related part. _thread_budget takes the array configuration in config.py (where you defined X2_grad) and reduces dimensions until all arrays can fit within the GPU for a given memory budget.

Then, the _gen_vis_slices method iterates over chunks of the ntime, nbl and nchan dimensions, creating slices for each dimensions on each iteration.

Each chunk is enqueued for execution on a thread associated with each device.

Ok, so you split arrays among GPUs only slicing along ntime,nbl,nchan, whereas each GPU process all the sources. Then my X2_grad (3,nssrc) wouldn’t be splitter among GPUs, would it? (Which is fine!)
The dimension slices are passed through to to this thread. For each chunk, we iterate over source batches, further slicing over the source dimensions (nsrc, ngsrc, nssrc). Then, all arrays required for a specific kernel are copied into pinned memory and an async memory copy to the GPU is scheduled on a CUDA stream. Directly after this, the CUDA kernel is enqueued on the the CUDA stream. See here for instance.

Sorry, I am not familiar with pyCUDA, are you saying that each CUDA stream refers to the sequence of kernels (E_beam, B_sqrt, ...) for the same source slice and then the same for the next slices?
So I have to wait until all the CUDA streams (i.e. all your kernels for all source slices) are finished before I can call the gradient computation (as it needs the final value of each visibility considered). This means I should add another loop over the source slices
for src_cpu_slice_map, src_gpu_slice_map in self._gen_source_slices():
after yours, where to call the kernel gradient.
When enqueueing arrays, the given CPU (and GPU) dimensions slices are used to create views of the CPU and GPU arrays to copy to/from.

It seems you treat every solver array, therefore also the X2_grad(3,nssrc) I added should be considered and I should worry about it. However a must initialise it to zero, what is the best way to do this? Maybe I can do this in the create_arrays function called from the CompositeSolver?
I see you've classified X2_grad as GPU_SCRATCH, which is fine if it is an output array similar to model_vis. I'd imagine you'll have to zero X2_grad on the first source batch like I do for model visibilities here

Model visibilities are accumulated for each separate source batch on the GPU. They are zeroed on the first batch (this is ignored if the VISIBILITY_ACCUMULATION is configured) and only after visibilities from the last batch have been accumulated is model_vis on the GPU copied back onto the CPU.

My case is different: I cannot set X2_grad equal to zero within the kernel because CUDA blocks are executed asynchronously and all of them contribute to all the elements of X2_grad related to the current sources slice. Each block computes for each Sersic source the contribution given by its slice of uv points, and at the end of its execution accumulates with atomic operations a single contribution to each element of X2_grad related to the current chunk of particles.

Therefore the X2_grad must be created (without slicing) and set to zero on each GPU before calling the gradient kernel.
What is the best way to do this?

I can create a python wrapper for the SersicChiSquaredGradient as you did for the other kernels and then call it in the solve function after self.rime_reduce.execute(self). I just need to be sure that observed_vis and visibilities are still in the GPU memory when the gradient kernel runs.
Array data for each associated visibility chunk and source batch (and hence model_vis and observed_vis) is always on the GPU, and the GPU kernels have this information to reason about where they are in the global problem space. See here for instance.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@sjperkins
Copy link
Member

Ok, so you split arrays among GPUs only slicing along ntime,nbl,nchan, whereas each GPU process all the sources. Then my X2_grad (3,nssrc) wouldn’t be splitter among GPUs, would it? (Which is fine!)

I've had a brief look at your code. I think what will happen is that a X2_grad (3, nssrc) will be generated per chunk. These X2_grads will then need to be added together to produce a final X2_grad. The way this works in the Compose Solver is that chunks are submitted to the thread associated with each GPU via a promises and futures framework. The future currently returns a X2, and model_vis associated with that chunk. These values need to either accumulated (in the case of the X2 or written out to the appropriate portion of the CPU model_vis array. See here for instance.

Sorry, I am not familiar with pyCUDA, are you saying that each CUDA stream refers to the sequence of kernels (E_beam, B_sqrt, ...) for the same source slice and then the same for the next slices? So I have to wait until all the CUDA streams (i.e. all your kernels for all source slices) are finished before I can call the gradient computation (as it needs the final value of each visibility considered). This means I should add another loop over the source slices for src_cpu_slice_map, src_gpu_slice_map in self._gen_source_slices(): after yours, where to call the kernel gradient.

Asynchronous operations (CUDA memory copies, CUDA kernels) are submitted to a CUDA stream and are executed in a non-blocking manner. Note that the chunk is enqueued on a device by submitting an enqueue operation on the thread associated with each GPU. Each thread is handled using a promises and futures framework. This enqueue operation returns a (possibly not completed) enqueue_future which, in turn is submitted to a synchronisation thread here. _sync_wait then blocks waiting for the result of the enqueue_future, and finally itself returns a future which the the solve() method (occasionally) blocks on here and here.

My case is different: I cannot set X2_grad equal to zero within the kernel because CUDA blocks are executed asynchronously and all of them contribute to all the elements of X2_grad related to the current sources slice. Each block computes for each Sersic source the contribution given by its slice of uv points, and at the end of its execution accumulates with atomic operations a single contribution to each element of X2_grad related to the current chunk of particles. Therefore the X2_grad must be created (without slicing) and set to zero on each GPU before calling the gradient kernel. What is the best way to do this?

I disagree. I think you can zero your X2_grad per chunk and then sum them as they are computed per chunk. This is what I've done with the X2 -- each chunk's X2 contributes to the total.

Finally, I notice that you've used atomics to compute the reduction for you X2_grad, compared to outputting each term individually For the X2 for e.g. Montblanc has a (ntime,nbl,nchan) chi_sqrd_result array which is summed using a PyCUDA reduction. I wonder how efficient using atomics to perform the reduction, compared to a parallel reduction. This suggests that the atomicAdds will be serialised.

The code you see above is very simple. Each thread reads a value from memory sequentially, which is quite fast. If the sum of all those numbers is needed, you might think it would be okay to simply use an atomicAdd operations. This would effectively calculate the final sum in one line of code, which might seem great. Unfortunately, all of those operations will be serialized, which is extremely slow. If you have 512 threads per thread block, each block would have to do 512 sequential additions. However, using a reduction method discussed in a previous tutorial would be able to accomplish the same task with just 511 additions. The key here is that these additions can be done in parallel. Generally, 16 or 32 additions can be done completely parallel, making it much faster than using an atomicAdd for a reduction problem such as this. So whenever you use atomic operations, be sure to program such that there won’t need to be too many sequential operations. Failure to do so will result in a dramatic loss of parallelism, and thus a dramatic loss in performance. However, when atomic operations are used correctly, they are extremely useful.

@marziarivi
Copy link
Contributor Author

On 21 Jul 2016, at 17:58, Simon Perkins [email protected] wrote:

Ok, so you split arrays among GPUs only slicing along ntime,nbl,nchan, whereas each GPU process all the sources. Then my X2_grad (3,nssrc) wouldn’t be splitter among GPUs, would it? (Which is fine!)

I've had a brief look at your code. I think what will happen is that a X2_grad (3, nssrc) will be generated per chunk. These X2_grads will then need to be added together to produce a final X2_grad. The way this works in the Compose Solver is that chunks are submitted to the thread associated with each GPU via a promises and futures framework. The future currently returns a X2, and model_vis associated with that chunk. These values need to either accumulated (in the case of the X2 or written out to the appropriate portion of the CPU model_vis array. See here for instance.

Ok, I agree.
Sorry, I am not familiar with pyCUDA, are you saying that each CUDA stream refers to the sequence of kernels (E_beam, B_sqrt, ...) for the same source slice and then the same for the next slices? So I have to wait until all the CUDA streams (i.e. all your kernels for all source slices) are finished before I can call the gradient computation (as it needs the final value of each visibility considered). This means I should add another loop over the source slices for src_cpu_slice_map, src_gpu_slice_map in self._gen_source_slices(): after yours, where to call the kernel gradient.

Asynchronous operations (CUDA memory copies, CUDA kernels) are submitted to a CUDA stream and are executed in a non-blocking manner. Note that the chunk is enqueued on a device by submitting an enqueue operation on the thread associated with each GPU. Each thread is handled using a promises and futures framework. This enqueue operation returns a (possibly not completed) enqueue_future which, in turn is submitted to a synchronisation thread here. _sync_wait then blocks waiting for the result of the enqueue_future, and finally itself returns a future which the the solve() method (occasionally) blocks on here and here.

Ok. So I should add another loop here, in order to have the final values for the model visibilities of the current chunk (i.e. after the iteration over all source slices).

for src_cpu_slice_map, src_gpu_slice_map in self._gen_source_slices():
# Update our maps with source slice information
# Configure dimension extents and global size on the sub-solver
# Enqueue X2 gradient computation

Then add X2_grad reduction as for X2.

My case is different: I cannot set X2_grad equal to zero within the kernel because CUDA blocks are executed asynchronously and all of them contribute to all the elements of X2_grad related to the current sources slice. Each block computes for each Sersic source the contribution given by its slice of uv points, and at the end of its execution accumulates with atomic operations a single contribution to each element of X2_grad related to the current chunk of particles. Therefore the X2_grad must be created (without slicing) and set to zero on each GPU before calling the gradient kernel. What is the best way to do this?

I disagree. I think you can zero your X2_grad per chunk and then sum them as they are computed per chunk. This is what I've done with the X2 -- each chunk's X2 contributes to the total.

Yes this is true and must be done in the same way. But for each chunk I cannot zero X2_grad within the kernel because each cuda block accumulates over the same array elements at the end of each particle processing and there is no way to synchronise cuda blocks at the beginning of the kernel. We should create directly on the GPU memory X2_grad set to zero before launching the kernel. This is what I did in v4.
Finally, I notice that you've used atomics to compute the reduction for you X2_grad, compared to outputting each term individually For the X2 for e.g. Montblanc has a (ntime,nbl,nchan) chi_sqrd_result array which is summed using a PyCUDA reduction. I wonder how efficient using atomics to perform the reduction, compared to a parallel reduction. This suggests that the atomicAdds will be serialised.

The code you see above is very simple. Each thread reads a value from memory sequentially, which is quite fast. If the sum of all those numbers is needed, you might think it would be okay to simply use an atomicAdd operations. This would effectively calculate the final sum in one line of code, which might seem great. Unfortunately, all of those operations will be serialized, which is extremely slow. If you have 512 threads per thread block, each block would have to do 512 sequential additions. However, using a reduction method discussed in a previous tutorial would be able to accomplish the same task with just 511 additions. The key here is that these additions can be done in parallel. Generally, 16 or 32 additions can be done completely parallel, making it much faster than using an atomicAdd for a reduction problem such as this. So whenever you use atomic operations, be sure to program such that there won’t need to be too many sequential operations. Failure to do so will result in a dramatic loss of parallelism, and thus a dramatic loss in performance. However, when atomic operations are used correctly, they are extremely useful.

Yes I know, this is why I did it hierarchically so that only one thread per block accesses the device memory only once, and threads within the block access only the shared memory (which is much faster). The reason for my choice is to reduce the amount of memory required. Using your approach would have required to allocate an array of size (3,nssrc,nbl,ntime,nchan) which is very large in case of many sources (as it is for weak lensing). In the v4 version this would have limited the amount of sources I could process. In v5 version we could actually split this array as the others but I am not sure it would be more performant because it means writing on the device memory for each source for each uv point and make a lot of GPU-CPU data transfer (for large problems), which are the slowest operations.

A solution that would avoid serialising atomic operations on the device memory and reduce atomics on the shared memory is to split sources among blocks, so that each block would access different parts of X2_grad and process a smaller number of sources at a time. But this would require a different grid structure for this kernel, e.g. blockDimz = sources_per_block and loop over the channel slices instead of over the sources (if possible). What do you think?

Maybe chatting via Skype would be easier for further explanation…

@ratt-priv-ci
Copy link
Collaborator

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants