-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chi sqrd gradient #147
base: master
Are you sure you want to change the base?
Chi sqrd gradient #147
Conversation
@marziarivi I'm keen to merge this into master. I haven't had a chance to look at it yet, but I aim to do this before master significantly diverges again. |
Great! Following your scheme, I am now trying to add it to version v5 as well in order to use many GPUs for my simulations. On 20 Jul 2016, at 13:38, Simon Perkins [email protected] wrote:
|
To implement the multi-GPU gradient, I need to understand few things:
Do you agree? On 20 Jul 2016, at 14:34, Marzia Rivi [email protected] wrote:
|
You're correct, it only copies the related part. _thread_budget takes the array configuration in config.py (where you defined X2_grad) and reduces dimensions until all arrays can fit within the GPU for a given memory budget. Then, the Each chunk is enqueued for execution on a thread associated with each device. The dimension slices are passed through to to this thread. For each chunk, we iterate over source batches, further slicing over the source dimensions ( Then, all arrays required for a specific kernel are copied into pinned memory and an async memory copy to the GPU is scheduled on a CUDA stream. Directly after this, the CUDA kernel is enqueued on the the CUDA stream. See here for instance. When enqueueing arrays, the given CPU (and GPU) dimensions slices are used to create views of the CPU and GPU arrays to copy to/from.
I see you've classified Model visibilities are accumulated for each separate source batch on the GPU. They are zeroed on the first batch (this is ignored if the VISIBILITY_ACCUMULATION is configured) and only after visibilities from the last batch have been accumulated is model_vis on the GPU copied back onto the CPU.
Array data for each associated visibility chunk and source batch (and hence |
On 20 Jul 2016, at 17:12, Simon Perkins [email protected] wrote:
Therefore the X2_grad must be created (without slicing) and set to zero on each GPU before calling the gradient kernel.
|
I've had a brief look at your code. I think what will happen is that a X2_grad (3, nssrc) will be generated per chunk. These X2_grads will then need to be added together to produce a final X2_grad. The way this works in the Compose Solver is that chunks are submitted to the thread associated with each GPU via a promises and futures framework. The future currently returns a
Asynchronous operations (CUDA memory copies, CUDA kernels) are submitted to a CUDA stream and are executed in a non-blocking manner. Note that the chunk is enqueued on a device by submitting an enqueue operation on the thread associated with each GPU. Each thread is handled using a promises and futures framework. This enqueue operation returns a (possibly not completed)
I disagree. I think you can zero your X2_grad per chunk and then sum them as they are computed per chunk. This is what I've done with the X2 -- each chunk's X2 contributes to the total. Finally, I notice that you've used atomics to compute the reduction for you X2_grad, compared to outputting each term individually For the X2 for e.g. Montblanc has a (ntime,nbl,nchan)
|
On 21 Jul 2016, at 17:58, Simon Perkins [email protected] wrote:
for src_cpu_slice_map, src_gpu_slice_map in self._gen_source_slices(): Then add X2_grad reduction as for X2.
A solution that would avoid serialising atomic operations on the device memory and reduce atomics on the shared memory is to split sources among blocks, so that each block would access different parts of X2_grad and process a smaller number of sources at a time. But this would require a different grid structure for this kernel, e.g. blockDimz = sources_per_block and loop over the channel slices instead of over the sources (if possible). What do you think? Maybe chatting via Skype would be easier for further explanation… |
…radient is not enabled
Provides more flexibility for different result types
This reverts commit 7457d38.
…selines and ntimes
…nt the case for a single channel where num threadIdx.x = 1 < 3
Can one of the admins verify this patch? |
CPU and GPU chi squared gradient implemented for the v4 version. Branch updated with the current master version. Example implemented to compare analytical gradient with numerical gradient computed as chi squared incremental ratio.