Make distr_queue singleton #7

n-io · 2020-01-23T16:06:22Z

n-io
Jan 23, 2020

Celerity only allows one instance of the celerity::distr_queue class, as captured by the error:

Only one celerity::distr_queue can be created per process (but it can be copied!)

As the error message suggests, work-arounds do exist, but they require additional work from the user for managing and copying queue instances. Doing this extra work comes at effectively no gain to the user. For instance, manually managing queues does make sense if multiple queues are possible and the user needs to know which ones they are dealing with. However, with only one queue instance permitted it would simplify the API and user code if this instance was managed by celerity.

Please consider pushing the work for managing the queue instance into celerity, exposing a singleton to the user.

psalz · 2020-01-24T15:28:38Z

psalz
Jan 24, 2020
Maintainer

I generally agree that this is a weird limitation that does not really translate into a good user experience. Ideally, I'd also like to move forward in the direction you suggest. However, after giving it some thought, I think there is an issue that needs careful consideration:

Right now, the distr_queue is essentially being reference-counted under the hood. This is necessary as the runtime's lifetime is tied to the queue: If the queue goes out of scope, it will block until all remaining work has been completed, and the entire Celerity runtime shuts down. If you have one or more copies of the queue, the program will block on whichever copy is destroyed last. I think this is a nice feature in general, as it allows you to synchronize Celerity with the main thread in a very idiomatic RAII way (even though the ability to copy the queue somewhat muddies the waters here already). Crucially however, it prevents users from accidentally exiting their program before the asynchronous work Celerity is doing in other threads has been completed.

If all queues go out of scope, and you then create a new one, you'll get an error that MPI cannot be initialized again, after having been finalized already. That's not ideal, and we should try and emit a more clear error message. That being said, I think moving to a distr_queue that can be freely instanced from anywhere inside a program might make this error a lot more common. Consider this program:

void do_something() { celerity::distr_queue queue; queue.submit(/**/); }
void do_something_else() { celerity::distr_queue queue; queue.submit(/**/); }
void main() {
  do_something();
  do_something_else();
}

Right now this is not possible, and our answer as to why is "well, you are only allowed to copy the queue". After this change, the code above MIGHT be allowed, but only if a third distr_queue instance exists somewhere for the entire duration of the two function calls (i.e., as the first thing on the stack in main()). I think this is quite a confusing requirement, especially for new users -- and I don't think the convenience of not having to copy outweighs the added complexity here.

As a possible solution, we could move away from the whole queue lifetime equals runtime lifetime concept. This would however mean that users have to explicitly synchronize at the end of every program -- not very nice. Also, while we do have distr_queue::slow_full_sync now, this is actually not the same operation under the hood. While slow_full_sync actually calls MPI_Barrier internally, the current distr_queue dtor can simply exit whenever no more work is left to do on that particular node, completely independently from others.

We could also make it actually re-initialize a new runtime instance. While that is of course possible (and that's actually what we're already doing for unit tests), I don't think this is what a user typically would want (especially since there is some overhead involved), and I'd like to see a legitimate use case for it first.

The only real solution that I can come up with right now is to have a parameter for distr_queue that indicates whether you'd want it to automatically manage lifetime or not. I think we could make that work, however we'll have to carefully consider all the possible interactions (e.g. what happens when you have both types of queues - should that be allowed? Must there always be a managed one?).

cc @PeterTh

Implementation wise, since the distr_queue does not actually have any state itself, and essentially only acts as an interface / lifetime manger for the runtime singleton, I don't think we need to make it a singleton of its own. Instead, a simple static ref counter should suffice. We just have to ensure that the ref count is handled correctly for copy / move constructors / assignment operators.

0 replies

n-io · 2020-01-24T15:51:24Z

n-io
Jan 24, 2020
Author

In terms of implementation, do you think it would be possible to go about it by adding a layer of indirection? I.e., we'd have one user-facing level with distr_queue objects that act as accessor objects and are reference counted and all. Under the hood they'd use a private queue_instance singleton object. Do you think that would give us enough of a framework to address the points raised by you?

0 replies

PeterTh · 2020-01-27T15:09:46Z

PeterTh
Jan 27, 2020
Maintainer

I feel like there are at least three perspectives on this, and the requirements / design constraints are somewhat different for each of them.

From the perspective of keeping simple things simple, and it being easy to teach a basic "Hello World" example of Celerity usage, the behavior of having a single queue with user-managed lifetime, RAII, and automatic synchronization is useful. It lets you write an example with simple host/client interaction without too much mental overhead.
However, at the same time, this approach does hide some of the complexity of the underlying execution timeline. Of course we want to hide this complexity to some extent (Celerity is supposed to be a high-level interface), but we also have to consider whether it might lead people to introduce bugs in larger programs later on.

From the perspective of a user with a large-scale program who wants to introduce Celerity piece-by-piece, having free access to "the" queue at arbitrary points seems like a useful feature simplifying Celerity usage. However, I wonder how valuable it is in practice: as soon as you want to transfer any GPU-side data between individual phases of your program, or re-use buffers, you'll probably need some abstraction for Celerity-relevant data that you pass around anyway. Since that is probably crucial in most programs for decent performance, at that point you could just as well include a reference to your queue in that data as well. (At least that has been my experience with porting one larger-scale program so far)

From the perspective of API "purity", and keeping the potential for lifetime errors to a minimum, even allowing the queue to be copied is already problematic. (That is not to say that people won't get lifetime issues even without copy-able queues; we need to make the compiler plugin standard and extend its coverage to really help out there)

I think the idea of having separate user-facing queues which are independent of the (singleton) "real" distributed queue might be a viable way to try and bridge these perspectives.
The way I see it, if such a change is introduced, I would make the user-facing queue non-copyable, and have it introduce a synchronization point at its destruction. To prevent that synchronization (if such functionality is required) I'd prefer something like std::thread::detach(), rather than a parameter at creation time. That would make it more explicit, and possible to write at the point in the program flow where it becomes acutely relevant.

However, that's a pretty significant change, and we should make sure that it's really solving issues without introducing new ones (or greatly increasing the complexity of either the implementation or mental model without sufficient payoff).

0 replies

psalz · 2020-01-30T08:38:52Z

psalz
Jan 30, 2020
Maintainer

I'm generally not loving the implications of the queue being copyable either (to be honest I don't think I thought it fully through back when I added that functionality).

I would suggest that we don't move forward with this until we have a clearer understanding of the trade-offs and what it is that we actually want to achieve with this API.

We know at least from the large-ish program that @PeterTh talked about, that passing around a Celerity context (that references the queue) can be a viable solution. As we are preparing to port another, larger program right now, I'll be interested to see how it pans out there.

One thing I think we absolutely have to ensure is that the out-of-the-box experience for new users is as seamless as possible. So we should definitely avoid surprising/hidden semantics regarding synchronization.

@n-io Could you maybe outline the use case that led you to opening this issue, so we can get a better understanding?

As a side note, during internal discussions of this issue we also encountered the question of multi-threading: I want to emphasize here that right now, we assume all work is being submitted to the queue from a single thread. At first glance, multiple "detached" queues might seem like a good way to introduce multi-threading capabilities to the Celerity API. However, we still have our strict SPMD/pre-pass/control-flow requirements, which I think might pan out to be really awkward when dealing with multiple threads.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make distr_queue singleton #7

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Make distr_queue singleton #7

n-io Jan 23, 2020

Replies: 4 comments

psalz Jan 24, 2020 Maintainer

n-io Jan 24, 2020 Author

PeterTh Jan 27, 2020 Maintainer

psalz Jan 30, 2020 Maintainer

n-io
Jan 23, 2020

psalz
Jan 24, 2020
Maintainer

n-io
Jan 24, 2020
Author

PeterTh
Jan 27, 2020
Maintainer

psalz
Jan 30, 2020
Maintainer