[proposal] Structured Async for Mojo #3945

owenhilyard · 2025-01-13T19:46:18Z

Proposes to add structured async to Mojo, following in the the Rust tradition of async since Mojo has the ability to fix many of the issues with Rust's async, some of which are ecosystem inflicted and some of which are inflicted by old decisions made about the language (such as that any value may be leaked). I think that this interface offers a lot more flexibility than the current one does for high performance code, while providing a better path for gradual evolution. It does have some language-level dependences, namely unions, and requires the introduction of Send and Sync traits, which are used to control data movement between threads.

Proposes to add structured async to Mojo, following in the the Rust tradition of async since Mojo has the ability to fix many of the issues with Rust's async, some of which are ecosystem inflicted and some of which are inflicted by old decisions made about the language (such as that any value may be leaked). I think that this interface offers a lot more flexibility than the current one does for high performance code, while providing a better path for gradual evolution. It does have some language-level dependences, namely unions, and requires the introduction of `Send` and `Sync` traits, which are used to control data movement between threads. Signed-off-by: Owen Hilyard <[email protected]>

szbergeron · 2025-01-14T17:46:46Z

Carrying some external discussion in for context on Waker--

The core motivation is that when building something of this sort, you want to spend as little time processing useless data as possible compared to processing useful data. This is simple in theory, but in practice the actual "knowledge" of what can vs can't be effectively progressed is sparse, disaggregated, and rarely has a clear way to even collect that state outside of just "poking" everything that "wants" to make progress.

Wakers partially solve this on their own. For many "boundaries" (things like channels, queues, timers), wakers can be more or less tossed over the barrier to be collected by the other side. When the "other side" is itself administrated by a coroutine registered with an active executor (especially the same executor) the system nicely passes control away from something that can't progress to something else that can, and doesn't waste more time on something that can't progress until It can.

This works very nicely for "closed" systems that can be reduced almost entirely to a single computation, and where there aren't other computational priorities that exist outside of the executor. Unfortunately, the outside world also exists. The most performant forms of I/O nowadays are not interrupt driven, and do not have a way to directly signal in and poke a waker. Also, many systems are going to have multiple priorities competing for attention, and they may not (for various architectural reasons) exist within the same executor even though they reside within the same process.

This motivates another cut along which to aggregate/concentrate useful "threads to pull": subsystems. You may have many places that you need to wait for an operation to complete within io_uring, or for some condition to occur in some region of shared memory. This requires busy polling, and has no straightforward (naive) way for a waker to be used to drive a computation. If this is reduced to every future that is waiting on an operation busy polling, we end up with an excess of duplicated computation being done with no progress (bad!).

What's the alternative? Busy polling must occur, but can be moderated and deduplicated. It may or may not even need to happen within the same executor, but it must happen somewhere. If we create a way for subsystems to be statically registered, or even dynamically registered with some prioritization flag for executors to poll them with priority more closely matching their utility (how many coroutines depend on them progressing), we can then treat them almost the same way as any other async "barrier" (such as channels, async mutexes, timers). This way, a coroutine corresponding to each subsystem can itself collect wakers from other coroutines that would otherwise busy poll on their own, and itself act as a form of scheduler, as alluded to in the proposal.

The implementation specifics have a bunch of intertwined tradeoffs, but as far as we can tell these broad strokes are the limit of how minimal the overall type structure can be made without necessarily sacrificing significant performance to duplicated computation (no-progress polling).

owenhilyard · 2025-01-14T18:08:04Z

Continuing on from what @szbergeron mentioned, there are a lot of IO mechanisms which are completion based. These busy-polled "subsystem" futures, which ideally can be spawned in a way that the executor is made aware of them as special, can help to de-duplicate a lot of that polling since most of these mechanisms deliver the results through some kind of queue. For epoll-like things, you still want a more central system place to handle polling the eventfd and waking things up. The current API doesn't really have a good way to support this kind of flexibility, so it all but guarantees we have the same executor lock-in issues that Go has and Rust has with Tokio, and doesn't leave room for libraries to experiment with different designs that the stdlib executor might benefit from or for high performance applications to have an executor which meets their own needs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proposal] Structured Async for Mojo #3945

[proposal] Structured Async for Mojo #3945

owenhilyard commented Jan 13, 2025

szbergeron commented Jan 14, 2025

owenhilyard commented Jan 14, 2025

[proposal] Structured Async for Mojo #3945

Are you sure you want to change the base?

[proposal] Structured Async for Mojo #3945

Conversation

owenhilyard commented Jan 13, 2025

szbergeron commented Jan 14, 2025

owenhilyard commented Jan 14, 2025