Add background thread to update device op queue upon receiving kernel interrupt #189

sophimao · 2022-10-27T13:34:01Z

Currently the runtime only makes update when:

A synchronization API call (e.g., clFinish) is called
An operation is being enqueued (e.g., clEnqueueNDRangeKernel)

This creates potential hang in the runtime. As there is limitation on FPGA that only one instance (launch)
of the same kernel can be running on the board at one time, multiple launches has to be sequential, where
later launches have to wait for the currently running launch to finish. For a kernel launch that has other
launches waiting on it, when non of the two above update calls happens after that kernel launch finishes,
the runtime will not submit waiting launches and thereby result in a hang.

In this change a background thread is added to update the device op queue when there is outstanding
kernel interrupt to be served.

… interrupt

sophimao · 2022-10-27T13:39:18Z

src/acl_thread.cpp

+    acl_get_platform()->initialized = 0;
+    acl_signal_condvar(&l_acl_global_condvar); // wake up waiting thread
+  }
+  acl_thread_join(&acl_get_platform()->device_op_queue_update_thread);


Not sure if this section of code is necessary here? The situation I have in mind is when there is an interrupt coming and waking up the device op queue update thread, and the program is about to unload the library, then the thread will acquire the lock and get preempted, when acl_reset_condvar is called it will see a locked lock, and that might result in some problem. But I guess having an interrupt coming in when the program is about to end is pretty rare?

Based on the unit tests it seems like this section might still be required, even with this joining code some of the CI jobs are failing when exiting the program:

6: TEST(auto_configure, many_ok_forward_compatibility) - 22 ms 6: TEST(auto_configure, simple) - 0 ms 6: 6: OK (224 tests, 224 ran, 138598587 checks, 0 ignored, 0 filtered out, 14787 ms) 6: 6: acl_test: ../lib/acl_threadsupport/src/acl_threadsupport.c:372: acl_release_condvar: Assertion `ret == 0' failed. 6/6 Test #6: acl_test .........................Child aborted***Exception: 21.01 sec

sophimao · 2022-10-27T13:42:40Z

src/acl_platform.cpp

+    std::scoped_lock lock{acl_mutex_wrapper};
+
+    // Sleep if no interrupt happening
+    acl_wait_for_device_update(NULL);


The side effect of using this existing call is that the thread gets woken up not just when there is a mmd triggered kernel interrupt, but also when all the signal handler functions are called (acl_receive_kernel_update, acl_set_execution_status, acl_receive_device_exception, and acl_schedule_printf_buffer_pickup). But this is already better than always polling, I wonder if it is necessary to fine grain this even further (might be risky)?

Can potentially

Introduce a new acl_condvar_s, or

Use a semaphore (is this doable?)

To signal to the device op queue update thread, to avoid waking up from event status update calls, etc.

However, the device op queue update thread might still need to acquire the global lock when doing the update, so that no other thread is modifying the shared resources (which are quite a lot...).

These are good questions! However, instead of considering the options for the choice of syncronisation mechanism, I suggest taking a step back and considering the data structures that are underlying this change and how they might be refactored to avoid (too much) locking due to concurrent access.

The primary goal here is to make forward progress in processing the device operations queue, correct? With this implementation, the queue is written to by multiple threads and read from by multiple threads (some spawned by the user and the background thread spawned by the runtime). Is there any advantage in processing the queue from multiple threads? To avoid contention over the global mutex, could the processing, i.e. reading from the queue be taken over by the background thread only? Such that the user's (potentially) multiple threads only feed, i.e. write to the queue?

If the queue is indeed the central data structure to this change, then the task becomes to find a thread-safe data structure. In the simplest implementation this could be a achieved using a dedicated mutex. Beyond that, third-party implementations are available, some of them lock-free, with different tradeoffs to be evaluated.

Thanks for the feedback! Yes, the primary goal is to make forward progress in processing the device operation queue. In the model you suggested, is it kind of like treating the device op queue as a buffer for consumer (runtime spawned thread) and producer (user threads)? This conceptually makes sense, just that I'm not sure if there are other runtime constructs (events, command queue, etc.) modified during the process of updating the device op queue, would that pose any obstacle to the consumer-producer model?

Also I'm thinking, if at the end of the day we are actively making progress on the device op queue, maybe it would be possible to remove some of the acl_idle_updates calls in clEnqueue... used to nudge the device op queue. I'm not sure if this idea is practical in any sense or if it will cause any issue though...

In the model you suggested, is it kind of like treating the device op queue as a buffer for consumer (runtime spawned thread) and producer (user threads)?

Yes, I am thinking of a multi-producer single-consumer queue. The runtime is thread-safe, but it is not efficient since it was designed for a single thread only. The addition of a background thread means a significant overhaul of this design.

This conceptually makes sense, just that I'm not sure if there are other runtime constructs (events, command queue, etc.) modified during the process of updating the device op queue, would that pose any obstacle to the consumer-producer model?

Indeed, that is the next question. As the background thread consumes new operations from the queue, can it perform its work without constantly obtaining the global mutex for other resources, thus blocking the foreground/user threads feeding the queue? What resources does the background thread need to perform its work? Which of these are shared with foreground threads and how frequently?

Also I'm thinking, if at the end of the day we are actively making progress on the device op queue, maybe it would be possible to remove some of the acl_idle_updates calls in clEnqueue... used to nudge the device op queue. I'm not sure if this idea is practical in any sense or if it will cause any issue though...

I am not certain either, but removing these is likely needed to avoid blocking the background thread. Maybe a good start is to draft a summary of all the tasks which the runtime currently performs in the "background" to make forward progress?

sophimao · 2022-10-27T13:45:17Z

Got a thread leak... Will look into this further.

I would imagine in a normal flow (not unit test) acl_platform will only be initialized once as user wouldn't have access to the acl_reset call, so the device op queue update thread will only be created once. But this is not the case for unit test, where acl_platform is initialized and reset multiple times. Need to find a good way to deal with all the threads created during this process.

sophimao · 2022-11-16T14:38:20Z

Performance testing using kernel_latency test:

a10_ref: Single kernel execution time: 30.8214 us -> 37.7533 us (This is not good 😞)
pac_a10: Some issue with kernel interface version id and I can't run the design 😦
Error: HAL Kern: Version mismatch! Expected 0xa0c00001 but read 0x0

sophimao · 2022-11-17T18:40:56Z

On synchronization between signal handler and other user threads, seems like the only safe primitives to use inside linux signal handler are:

sig_atomic_t
lock-free std::atomic

Note, in c++20, std::atomic include conditional wait calls wait, notify_one and notify_all

Potential ways to synchronize with signal handler are:

Use semaphores (comes with c++20 in the c++ concurrency library)
Self-pipe trick (linux only)
Use sigwait/sigwaitinfo (linux only)
Use signalfd (linux only)
lock-free std::atomic (must poll without the wait and notify calls)
sig_atomic_t (must poll)

All seems to have some problem so perhaps we still have to continue using our async-safe condition variable?

pcolberg

Thanks @sophimao for tackling this! Please don't worry about addressing any comments with code changes; for now, I would like to work out the design on a high level, particularly the affected data structure(s) as detailed below.

pcolberg · 2022-11-18T19:53:44Z

src/acl_globals.cpp

+    l_reset_present_board();
+
+    acl_platform.offline_device = "";
+    acl_platform.num_devices = 0;
+    for (unsigned i = 0; i < ACL_MAX_DEVICE; ++i) {
+      acl_platform.device[i] = _cl_device_id();
+    }
+    acl_platform.initialized = 0;


This (almost) duplicates acl_reset(). Can that function be called instead?

I remember seeing some issue if I call acl_signal_device_update without the lock or in another critical section, but I can't remember clearly if that's the case. I'll try to think through it again to see if using acl_reset here is okay.

pcolberg · 2022-11-18T19:54:28Z

include/acl_globals.h

+// Version of reset used in unit test only
+void acl_reset_join_thread(void);


If this is used in the unit tests only it should not be in the include or src directories, but under test.

pcolberg · 2022-11-18T19:58:02Z

include/acl_types.h

@@ -1636,6 +1636,9 @@ typedef struct _cl_platform_id
  // The device operation queue.
  // These are the operations that can run immediately on the device.
  acl_device_op_queue_t device_op_queue;
+  // Thread used to update device_op_queue when kernel interrupt triggers
+  acl_thread_t device_op_queue_update_thread;
+  bool outstanding_interrupt;


Note for my understanding: Reads/writes to this variable are guarded by the global mutex.

pcolberg · 2022-11-18T20:00:06Z

src/acl_platform.cpp

@@ -737,6 +742,25 @@ static void l_add_device(int idx) {
  device->address_bits = 64;              // Yes, our devices are 64-bit.
 }

+void *l_eagerly_update_device_op_queue(void *arg) {
+  while (true) {
+    std::scoped_lock lock{acl_mutex_wrapper};


I am confused how this is not blocking runtime progress. If the mutex is acquired first and the wait for a device update is second, won't the mutex be locked most of the time, such that other threads trying to acquire the mutex cannot make progress?

When the thread waits it releases the lock so that other threads can make progress, are you referring to the time between the lock and wait during which other threads will be blocked?

pcolberg · 2022-11-18T20:12:51Z

src/acl_platform.cpp

+    std::scoped_lock lock{acl_mutex_wrapper};
+
+    // Sleep if no interrupt happening
+    acl_wait_for_device_update(NULL);


These are good questions! However, instead of considering the options for the choice of syncronisation mechanism, I suggest taking a step back and considering the data structures that are underlying this change and how they might be refactored to avoid (too much) locking due to concurrent access.

The primary goal here is to make forward progress in processing the device operations queue, correct? With this implementation, the queue is written to by multiple threads and read from by multiple threads (some spawned by the user and the background thread spawned by the runtime). Is there any advantage in processing the queue from multiple threads? To avoid contention over the global mutex, could the processing, i.e. reading from the queue be taken over by the background thread only? Such that the user's (potentially) multiple threads only feed, i.e. write to the queue?

If the queue is indeed the central data structure to this change, then the task becomes to find a thread-safe data structure. In the simplest implementation this could be a achieved using a dedicated mutex. Beyond that, third-party implementations are available, some of them lock-free, with different tradeoffs to be evaluated.

pcolberg · 2022-11-18T20:16:02Z

src/acl_thread.cpp

+  if (acl_get_platform()->device_op_queue_update_thread) {
+    {
+      std::scoped_lock lock{acl_mutex_wrapper};
+      acl_get_platform()->initialized = 0;
+      acl_signal_condvar(&l_acl_global_condvar); // wake up waiting thread
+    }
+    acl_thread_join(&acl_get_platform()->device_op_queue_update_thread);
+  }


Why is the thread joined in the library destructor? From my naïve understanding, I would expect the thread to be created on clCreateContext() and joined on (the last) clReleaseContext()?

I originally included the thread as a member of acl_platform so I created it when initializing the platform. However the acl_reset function is never called in a user flow so I had to put it in the destructor... I agree that it's probably better to do this when creating and releasing context, but need to add some flag indicating the first creation and the last release.

sophimao added 2 commits October 25, 2022 08:28

Fix typo in acl_threadsupport library

3d9dcdd

Add background thread to update device op queue upon receiving kernel…

ea9aac3

… interrupt

sophimao commented Oct 27, 2022

View reviewed changes

sophimao added 2 commits October 28, 2022 13:49

Resolve unit test thread leak issue

5e7b0de

Run clang-format

c136e33

pcolberg assigned sophimao Nov 18, 2022

pcolberg added the enhancement New feature or request label Nov 18, 2022

pcolberg added this to the 2023.2 milestone Nov 18, 2022

pcolberg reviewed Nov 18, 2022

View reviewed changes

zibaiwan removed this from the 2023.2 milestone May 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add background thread to update device op queue upon receiving kernel interrupt #189

Add background thread to update device op queue upon receiving kernel interrupt #189

sophimao commented Oct 27, 2022

sophimao Oct 27, 2022

sophimao Oct 28, 2022

sophimao Oct 27, 2022 •

edited

Loading

sophimao Nov 17, 2022 •

edited

Loading

pcolberg Nov 18, 2022 •

edited

Loading

sophimao Nov 18, 2022

pcolberg Nov 18, 2022 •

edited

Loading

sophimao commented Oct 27, 2022 •

edited

Loading

sophimao commented Nov 16, 2022

sophimao commented Nov 17, 2022 •

edited

Loading

pcolberg left a comment

pcolberg Nov 18, 2022

sophimao Nov 18, 2022

pcolberg Nov 18, 2022

pcolberg Nov 18, 2022

pcolberg Nov 18, 2022

sophimao Nov 18, 2022

pcolberg Nov 18, 2022 •

edited

Loading

pcolberg Nov 18, 2022

sophimao Nov 18, 2022

		// Version of reset used in unit test only
		void acl_reset_join_thread(void);

Add background thread to update device op queue upon receiving kernel interrupt #189

Are you sure you want to change the base?

Add background thread to update device op queue upon receiving kernel interrupt #189

Conversation

sophimao commented Oct 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sophimao Oct 27, 2022 • edited Loading

Choose a reason for hiding this comment

sophimao Nov 17, 2022 • edited Loading

Choose a reason for hiding this comment

pcolberg Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcolberg Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

sophimao commented Oct 27, 2022 • edited Loading

sophimao commented Nov 16, 2022

sophimao commented Nov 17, 2022 • edited Loading

pcolberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcolberg Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sophimao Oct 27, 2022 •

edited

Loading

sophimao Nov 17, 2022 •

edited

Loading

pcolberg Nov 18, 2022 •

edited

Loading

pcolberg Nov 18, 2022 •

edited

Loading

sophimao commented Oct 27, 2022 •

edited

Loading

sophimao commented Nov 17, 2022 •

edited

Loading

pcolberg Nov 18, 2022 •

edited

Loading