-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add background thread to update device op queue upon receiving kernel interrupt #189
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1636,6 +1636,9 @@ typedef struct _cl_platform_id | |
// The device operation queue. | ||
// These are the operations that can run immediately on the device. | ||
acl_device_op_queue_t device_op_queue; | ||
// Thread used to update device_op_queue when kernel interrupt triggers | ||
acl_thread_t device_op_queue_update_thread; | ||
bool outstanding_interrupt; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note for my understanding: Reads/writes to this variable are guarded by the global mutex. |
||
|
||
// Limits. See clGetDeviceInfo for semantics. | ||
unsigned int max_param_size; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -231,6 +231,35 @@ void acl_reset(void) { | |
acl_platform.initialized = 0; | ||
} | ||
|
||
// This function should only be used in the unit test | ||
void acl_reset_join_thread(void) { | ||
{ | ||
std::scoped_lock lock{acl_mutex_wrapper}; | ||
|
||
l_reset_present_board(); | ||
|
||
acl_platform.offline_device = ""; | ||
acl_platform.num_devices = 0; | ||
for (unsigned i = 0; i < ACL_MAX_DEVICE; ++i) { | ||
acl_platform.device[i] = _cl_device_id(); | ||
} | ||
acl_platform.initialized = 0; | ||
Comment on lines
+239
to
+246
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This (almost) duplicates There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I remember seeing some issue if I call |
||
acl_signal_device_update(); | ||
} | ||
// Each unit test test groups are sequentially run and acl_init and acl_reset | ||
// is called once at the start (setup) and end (teardown) of the test group. | ||
// As acl_init wouldn't be called before acl_reset finished, it is okay to | ||
// block here to wait for the device op queue update thread to finish here. | ||
|
||
// Note that the join has to be called without holding the acl global lock, if | ||
// reset acquires lock and wait, the device op queue update thread will try to | ||
// obtain the lock forever, resulting in deadlock in the unit test. | ||
if (acl_platform.device_op_queue_update_thread != 0) { | ||
acl_thread_join(&acl_platform.device_op_queue_update_thread); | ||
acl_platform.device_op_queue_update_thread = 0; | ||
} | ||
} | ||
|
||
//////////////////////////////////////////////////// | ||
// Static functions | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -78,6 +78,7 @@ static void l_initialize_devices(const acl_system_def_t *present_board_def, | |
int offline_mode, unsigned int num_devices, | ||
const cl_device_id *devices); | ||
static void l_add_device(int idx); | ||
void *l_eagerly_update_device_op_queue(void *arg); | ||
|
||
////////////////////////////// | ||
// OpenCL API | ||
|
@@ -412,6 +413,10 @@ void acl_init_platform(void) { | |
|
||
// Device operation queue. | ||
acl_init_device_op_queue(&acl_platform.device_op_queue); | ||
// Send off device_op_queue update thread | ||
acl_platform.outstanding_interrupt = 0; | ||
acl_thread_create(&acl_platform.device_op_queue_update_thread, 0, | ||
l_eagerly_update_device_op_queue, NULL); | ||
|
||
// Initialize sampler allocator. | ||
for (int i = 0; i < ACL_MAX_SAMPLER; i++) { | ||
|
@@ -737,6 +742,25 @@ static void l_add_device(int idx) { | |
device->address_bits = 64; // Yes, our devices are 64-bit. | ||
} | ||
|
||
void *l_eagerly_update_device_op_queue(void *arg) { | ||
while (true) { | ||
std::scoped_lock lock{acl_mutex_wrapper}; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am confused how this is not blocking runtime progress. If the mutex is acquired first and the wait for a device update is second, won't the mutex be locked most of the time, such that other threads trying to acquire the mutex cannot make progress? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When the thread waits it releases the lock so that other threads can make progress, are you referring to the time between the lock and wait during which other threads will be blocked? |
||
|
||
// Sleep if no interrupt happening | ||
acl_wait_for_device_update(NULL); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The side effect of using this existing call is that the thread gets woken up not just when there is a mmd triggered kernel interrupt, but also when all the signal handler functions are called ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can potentially
To signal to the device op queue update thread, to avoid waking up from event status update calls, etc. However, the device op queue update thread might still need to acquire the global lock when doing the update, so that no other thread is modifying the shared resources (which are quite a lot...). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are good questions! However, instead of considering the options for the choice of syncronisation mechanism, I suggest taking a step back and considering the data structures that are underlying this change and how they might be refactored to avoid (too much) locking due to concurrent access. The primary goal here is to make forward progress in processing the device operations queue, correct? With this implementation, the queue is written to by multiple threads and read from by multiple threads (some spawned by the user and the background thread spawned by the runtime). Is there any advantage in processing the queue from multiple threads? To avoid contention over the global mutex, could the processing, i.e. reading from the queue be taken over by the background thread only? Such that the user's (potentially) multiple threads only feed, i.e. write to the queue? If the queue is indeed the central data structure to this change, then the task becomes to find a thread-safe data structure. In the simplest implementation this could be a achieved using a dedicated mutex. Beyond that, third-party implementations are available, some of them lock-free, with different tradeoffs to be evaluated. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the feedback! Yes, the primary goal is to make forward progress in processing the device operation queue. In the model you suggested, is it kind of like treating the device op queue as a buffer for consumer (runtime spawned thread) and producer (user threads)? This conceptually makes sense, just that I'm not sure if there are other runtime constructs (events, command queue, etc.) modified during the process of updating the device op queue, would that pose any obstacle to the consumer-producer model? Also I'm thinking, if at the end of the day we are actively making progress on the device op queue, maybe it would be possible to remove some of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I am thinking of a multi-producer single-consumer queue. The runtime is thread-safe, but it is not efficient since it was designed for a single thread only. The addition of a background thread means a significant overhaul of this design.
Indeed, that is the next question. As the background thread consumes new operations from the queue, can it perform its work without constantly obtaining the global mutex for other resources, thus blocking the foreground/user threads feeding the queue? What resources does the background thread need to perform its work? Which of these are shared with foreground threads and how frequently?
I am not certain either, but removing these is likely needed to avoid blocking the background thread. Maybe a good start is to draft a summary of all the tasks which the runtime currently performs in the "background" to make forward progress? |
||
|
||
if (!acl_platform.initialized) { | ||
break; | ||
} | ||
if (acl_platform.outstanding_interrupt) { | ||
acl_print_debug_msg("Serving outstanding kernel interrupt...\n"); | ||
acl_update_device_op_queue(&(acl_platform.device_op_queue)); | ||
acl_platform.outstanding_interrupt = 0; | ||
} | ||
} | ||
return NULL; | ||
} | ||
|
||
// These functions check to see if a given object is known to the system. | ||
// acl_*_is_valid( * ); | ||
// This is simple because everything is statically allocated. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ | |
#include <acl_context.h> | ||
#include <acl_hal.h> | ||
#include <acl_thread.h> | ||
#include <acl_util.h> | ||
|
||
ACL_TLS int acl_global_lock_count = 0; | ||
ACL_TLS int acl_inside_sig_flag = 0; | ||
|
@@ -55,7 +56,7 @@ void acl_mutex_wrapper_t::resume_lock(int lock_count) { | |
|
||
void acl_wait_for_device_update(cl_context context) { | ||
acl_assert_locked(); | ||
if (acl_get_hal()->get_debug_verbosity && | ||
if (acl_context_is_valid(context) && acl_get_hal()->get_debug_verbosity && | ||
acl_get_hal()->get_debug_verbosity() > 0) { | ||
unsigned timeout = 5; // Seconds | ||
// Keep waiting until signal is received | ||
|
@@ -102,6 +103,14 @@ __attribute__((constructor)) static void l_global_lock_init() { | |
} | ||
|
||
__attribute__((destructor)) static void l_global_lock_uninit() { | ||
if (acl_get_platform()->device_op_queue_update_thread) { | ||
{ | ||
std::scoped_lock lock{acl_mutex_wrapper}; | ||
acl_get_platform()->initialized = 0; | ||
acl_signal_condvar(&l_acl_global_condvar); // wake up waiting thread | ||
} | ||
acl_thread_join(&acl_get_platform()->device_op_queue_update_thread); | ||
} | ||
Comment on lines
+106
to
+113
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is the thread joined in the library destructor? From my naïve understanding, I would expect the thread to be created on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I originally included the thread as a member of |
||
acl_reset_condvar(&l_acl_global_condvar); | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is used in the unit tests only it should not be in the
include
orsrc
directories, but undertest
.