-
Notifications
You must be signed in to change notification settings - Fork 3
Creating and Accessing Buffers
User only needs to handle the host-side buffers to configure kernel I/O & host-side interaction. All host parameters are aligned to 4096 bytes and extended to multiple of 4096 bytes to maintain high-performance during copies & zero-copy accesses from devices.
If a host parameter has number of elements not equal to integer-multiple of number of local-threads / work-group-size (such as 256), then it can only be used as a random-access buffer and has to be copied with all elements to each device before kernel execution.
When host parameter has elements equal to integer-multiple of work group size, it can be used as input/output of type "embarrassingly parallel" where every work-item (i.e. GPU thread / CPU SIMD pipeline) only read its own input element and write its own result element.
For example, to add 1 to all elements of an array (that is 1024 elements long), there is no need for random-access behavior in kernel, so all devices(gpus, accelerators, cpus) can only copy its own region of input/output to save bandwidth:
// create parameters of kernel (also allocated in each device)
bool isAinput = true;
bool isBinput = false;
bool isAoutput = false;
bool isBoutput = true;
bool isInputRandomAccess = false;
int dataElementsPerThread = 1;
GPGPU::HostParameter a = computer.createHostParameter<int>("a", 1024, dataElementsPerThread, isAinput, isAoutput, isInputRandomAccess);
GPGPU::HostParameter b = computer.createHostParameter<int>("b", 1024, dataElementsPerThread, isBinput, isBoutput, isInputRandomAccess);
In this example, each thread is expected to read only 1 element (with its id value as index) from a and write result to b in same way. If work-group-size (local threads) is 256, then there can be maximum 4 devices to share the workload. More devices can not be distributed enough work. Minimum amount per device is "local threads". When number of global threads (n=1024) is high enough, devices can trade work-groups to balance the total workload to minimize the running time of kernel.
Output arrays are always non-random access and all devices' outputs are combined on the relevant host parameter without collisions.
Kernel code is expected to not go out of bounds of elements of its own work-group so it is only safe to access elements with neighboring threads in same local group. For example, when local threads = 256, a work item with id=5 can also read input of id=255.
When kernel parameters needed to be changed, setting parameters one by one takes too many lines of code. Instead, method-chaining can be used inside computer.compute(..)
method.
GPGPU::HostParameter a = computer.createHostParameter<float>(...);
GPGPU::HostParameter b = computer.createHostParameter<float>(...);
GPGPU::HostParameter c = computer.createHostParameter<float>(...);
GPGPU::HostParameter list = a.next(b).next(c);
// kernel has 3 parameters with same order like: kernel kernel_name(global float * a, global float * b, global float * c ){ ... }
computer.compute(list, "kernel_name", 0, 1000, 100);