Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add background thread to update device op queue upon receiving kernel interrupt #189

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sophimao
Copy link
Contributor

Currently the runtime only makes update when:

  • A synchronization API call (e.g., clFinish) is called
  • An operation is being enqueued (e.g., clEnqueueNDRangeKernel)

This creates potential hang in the runtime. As there is limitation on FPGA that only one instance (launch)
of the same kernel can be running on the board at one time, multiple launches has to be sequential, where
later launches have to wait for the currently running launch to finish. For a kernel launch that has other
launches waiting on it, when non of the two above update calls happens after that kernel launch finishes,
the runtime will not submit waiting launches and thereby result in a hang.

In this change a background thread is added to update the device op queue when there is outstanding
kernel interrupt to be served.

acl_get_platform()->initialized = 0;
acl_signal_condvar(&l_acl_global_condvar); // wake up waiting thread
}
acl_thread_join(&acl_get_platform()->device_op_queue_update_thread);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this section of code is necessary here? The situation I have in mind is when there is an interrupt coming and waking up the device op queue update thread, and the program is about to unload the library, then the thread will acquire the lock and get preempted, when acl_reset_condvar is called it will see a locked lock, and that might result in some problem. But I guess having an interrupt coming in when the program is about to end is pretty rare?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the unit tests it seems like this section might still be required, even with this joining code some of the CI jobs are failing when exiting the program:

6: TEST(auto_configure, many_ok_forward_compatibility) - 22 ms
6: TEST(auto_configure, simple) - 0 ms
6: 
6: OK (224 tests, 224 ran, 138598587 checks, 0 ignored, 0 filtered out, 14787 ms)
6: 
6: acl_test: ../lib/acl_threadsupport/src/acl_threadsupport.c:372: acl_release_condvar: Assertion `ret == 0' failed.
6/6 Test #6: acl_test .........................Child aborted***Exception:  21.01 sec

std::scoped_lock lock{acl_mutex_wrapper};

// Sleep if no interrupt happening
acl_wait_for_device_update(NULL);
Copy link
Contributor Author

@sophimao sophimao Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The side effect of using this existing call is that the thread gets woken up not just when there is a mmd triggered kernel interrupt, but also when all the signal handler functions are called (acl_receive_kernel_update, acl_set_execution_status, acl_receive_device_exception, and acl_schedule_printf_buffer_pickup). But this is already better than always polling, I wonder if it is necessary to fine grain this even further (might be risky)?

Copy link
Contributor Author

@sophimao sophimao Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can potentially

  • Introduce a new acl_condvar_s, or
  • Use a semaphore (is this doable?)

To signal to the device op queue update thread, to avoid waking up from event status update calls, etc.

However, the device op queue update thread might still need to acquire the global lock when doing the update, so that no other thread is modifying the shared resources (which are quite a lot...).

Copy link
Contributor

@pcolberg pcolberg Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good questions! However, instead of considering the options for the choice of syncronisation mechanism, I suggest taking a step back and considering the data structures that are underlying this change and how they might be refactored to avoid (too much) locking due to concurrent access.

The primary goal here is to make forward progress in processing the device operations queue, correct? With this implementation, the queue is written to by multiple threads and read from by multiple threads (some spawned by the user and the background thread spawned by the runtime). Is there any advantage in processing the queue from multiple threads? To avoid contention over the global mutex, could the processing, i.e. reading from the queue be taken over by the background thread only? Such that the user's (potentially) multiple threads only feed, i.e. write to the queue?

If the queue is indeed the central data structure to this change, then the task becomes to find a thread-safe data structure. In the simplest implementation this could be a achieved using a dedicated mutex. Beyond that, third-party implementations are available, some of them lock-free, with different tradeoffs to be evaluated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! Yes, the primary goal is to make forward progress in processing the device operation queue. In the model you suggested, is it kind of like treating the device op queue as a buffer for consumer (runtime spawned thread) and producer (user threads)? This conceptually makes sense, just that I'm not sure if there are other runtime constructs (events, command queue, etc.) modified during the process of updating the device op queue, would that pose any obstacle to the consumer-producer model?

Also I'm thinking, if at the end of the day we are actively making progress on the device op queue, maybe it would be possible to remove some of the acl_idle_updates calls in clEnqueue... used to nudge the device op queue. I'm not sure if this idea is practical in any sense or if it will cause any issue though...

Copy link
Contributor

@pcolberg pcolberg Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the model you suggested, is it kind of like treating the device op queue as a buffer for consumer (runtime spawned thread) and producer (user threads)?

Yes, I am thinking of a multi-producer single-consumer queue. The runtime is thread-safe, but it is not efficient since it was designed for a single thread only. The addition of a background thread means a significant overhaul of this design.

This conceptually makes sense, just that I'm not sure if there are other runtime constructs (events, command queue, etc.) modified during the process of updating the device op queue, would that pose any obstacle to the consumer-producer model?

Indeed, that is the next question. As the background thread consumes new operations from the queue, can it perform its work without constantly obtaining the global mutex for other resources, thus blocking the foreground/user threads feeding the queue? What resources does the background thread need to perform its work? Which of these are shared with foreground threads and how frequently?

Also I'm thinking, if at the end of the day we are actively making progress on the device op queue, maybe it would be possible to remove some of the acl_idle_updates calls in clEnqueue... used to nudge the device op queue. I'm not sure if this idea is practical in any sense or if it will cause any issue though...

I am not certain either, but removing these is likely needed to avoid blocking the background thread. Maybe a good start is to draft a summary of all the tasks which the runtime currently performs in the "background" to make forward progress?

@sophimao
Copy link
Contributor Author

sophimao commented Oct 27, 2022

Got a thread leak... Will look into this further.

I would imagine in a normal flow (not unit test) acl_platform will only be initialized once as user wouldn't have access to the acl_reset call, so the device op queue update thread will only be created once. But this is not the case for unit test, where acl_platform is initialized and reset multiple times. Need to find a good way to deal with all the threads created during this process.

@sophimao
Copy link
Contributor Author

Performance testing using kernel_latency test:

  • a10_ref: Single kernel execution time: 30.8214 us -> 37.7533 us (This is not good 😞)
  • pac_a10: Some issue with kernel interface version id and I can't run the design 😦
    Error: HAL Kern: Version mismatch! Expected 0xa0c00001 but read 0x0

@sophimao
Copy link
Contributor Author

sophimao commented Nov 17, 2022

On synchronization between signal handler and other user threads, seems like the only safe primitives to use inside linux signal handler are:

Note, in c++20, std::atomic include conditional wait calls wait, notify_one and notify_all

Potential ways to synchronize with signal handler are:

  • Use semaphores (comes with c++20 in the c++ concurrency library)
  • Self-pipe trick (linux only)
  • Use sigwait/sigwaitinfo (linux only)
  • Use signalfd (linux only)
  • lock-free std::atomic (must poll without the wait and notify calls)
  • sig_atomic_t (must poll)

All seems to have some problem so perhaps we still have to continue using our async-safe condition variable?

@pcolberg pcolberg added the enhancement New feature or request label Nov 18, 2022
@pcolberg pcolberg added this to the 2023.2 milestone Nov 18, 2022
Copy link
Contributor

@pcolberg pcolberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sophimao for tackling this! Please don't worry about addressing any comments with code changes; for now, I would like to work out the design on a high level, particularly the affected data structure(s) as detailed below.

Comment on lines +239 to +246
l_reset_present_board();

acl_platform.offline_device = "";
acl_platform.num_devices = 0;
for (unsigned i = 0; i < ACL_MAX_DEVICE; ++i) {
acl_platform.device[i] = _cl_device_id();
}
acl_platform.initialized = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (almost) duplicates acl_reset(). Can that function be called instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember seeing some issue if I call acl_signal_device_update without the lock or in another critical section, but I can't remember clearly if that's the case. I'll try to think through it again to see if using acl_reset here is okay.

Comment on lines +25 to +26
// Version of reset used in unit test only
void acl_reset_join_thread(void);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is used in the unit tests only it should not be in the include or src directories, but under test.

@@ -1636,6 +1636,9 @@ typedef struct _cl_platform_id
// The device operation queue.
// These are the operations that can run immediately on the device.
acl_device_op_queue_t device_op_queue;
// Thread used to update device_op_queue when kernel interrupt triggers
acl_thread_t device_op_queue_update_thread;
bool outstanding_interrupt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for my understanding: Reads/writes to this variable are guarded by the global mutex.

@@ -737,6 +742,25 @@ static void l_add_device(int idx) {
device->address_bits = 64; // Yes, our devices are 64-bit.
}

void *l_eagerly_update_device_op_queue(void *arg) {
while (true) {
std::scoped_lock lock{acl_mutex_wrapper};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused how this is not blocking runtime progress. If the mutex is acquired first and the wait for a device update is second, won't the mutex be locked most of the time, such that other threads trying to acquire the mutex cannot make progress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the thread waits it releases the lock so that other threads can make progress, are you referring to the time between the lock and wait during which other threads will be blocked?

std::scoped_lock lock{acl_mutex_wrapper};

// Sleep if no interrupt happening
acl_wait_for_device_update(NULL);
Copy link
Contributor

@pcolberg pcolberg Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good questions! However, instead of considering the options for the choice of syncronisation mechanism, I suggest taking a step back and considering the data structures that are underlying this change and how they might be refactored to avoid (too much) locking due to concurrent access.

The primary goal here is to make forward progress in processing the device operations queue, correct? With this implementation, the queue is written to by multiple threads and read from by multiple threads (some spawned by the user and the background thread spawned by the runtime). Is there any advantage in processing the queue from multiple threads? To avoid contention over the global mutex, could the processing, i.e. reading from the queue be taken over by the background thread only? Such that the user's (potentially) multiple threads only feed, i.e. write to the queue?

If the queue is indeed the central data structure to this change, then the task becomes to find a thread-safe data structure. In the simplest implementation this could be a achieved using a dedicated mutex. Beyond that, third-party implementations are available, some of them lock-free, with different tradeoffs to be evaluated.

Comment on lines +106 to +113
if (acl_get_platform()->device_op_queue_update_thread) {
{
std::scoped_lock lock{acl_mutex_wrapper};
acl_get_platform()->initialized = 0;
acl_signal_condvar(&l_acl_global_condvar); // wake up waiting thread
}
acl_thread_join(&acl_get_platform()->device_op_queue_update_thread);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the thread joined in the library destructor? From my naïve understanding, I would expect the thread to be created on clCreateContext() and joined on (the last) clReleaseContext()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally included the thread as a member of acl_platform so I created it when initializing the platform. However the acl_reset function is never called in a user flow so I had to put it in the destructor... I agree that it's probably better to do this when creating and releasing context, but need to add some flag indicating the first creation and the last release.

@zibaiwan zibaiwan removed this from the 2023.2 milestone May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants