-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streams #31
Comments
Some handler might need multiple streams, so I guess it needs to be a list of arrays.
|
Yeah, that looks nice! |
How about we (ab)use indexing notation for that: _h[1].dot_add_mm(dIa[t], x[t], dWi, transa=True)
_h[2].dot_add_mm(dFa[t], x[t], dWf, transa=True)
_h[3].dot_add_mm(dOa[t], x[t], dWo, transa=True)
_h[4].dot_add_mm(dZa[t], x[t], dWz, transa=True) If h1 = _h[1]
h1.dot_add_mm(dFa[t], x[t], dW, transa=True)
h1.dot_add_mm(dOa[t], x[t], dW, transa=True)
h1.dot_add_mm(dZa[t], x[t], dW, transa=True) |
Another thing to keep in mind: it'd be nice if streams can be specified for layers too. Then we could run layers in parallel, which would be nice. Of course, just like one needs to know how many streams are used by an operation while writing a layer implementation, one would also need to know how many streams are used by a layer while building a network. This isn't too much to ask: the docs should take care of it ;) |
I don't like the abused indexing notation, its a bit too unintuitive for someone who doesn't know the codebase too well. I'd rather do something like
where |
Ok, that's a fair point. What I don't like about _h.set_stream(1).dot_add_mm(dIa[t], x[t], dWi, transa=True)
_h.dot_add_mm(dFa[t], x[t], dWf, transa=True)
_h.dot_add_mm(dOa[t], x[t], dWo, transa=True)
_h.dot_add_mm(dZa[t], x[t], dWz, transa=True) We could make some kind of _h.with_stream(1).dot_add_mm(dIa[t], x[t], dWi, transa=True)
_h.with_stream(2).dot_add_mm(dFa[t], x[t], dWf, transa=True)
_h.with_stream(3).dot_add_mm(dOa[t], x[t], dWo, transa=True)
_h.with_stream(4).dot_add_mm(dZa[t], x[t], dWz, transa=True) But that of course implies some (small) overhead. |
Alright, to summarize: 1 - We can add stream as an argument to all operations, but then we do it for other handlers which may not use streams, so it's a bit weird. 2 - We can use set_streams() without returning anything. Then we'd do _h.set_streams([1])
_h.dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
_h.set_streams([2])
_h.dot_mm(flat_dH, flat_input, out=dW, transa=True)
_h.sum_t(flat_dH, axis=0, out=dbias) # runs on stream 2 This option means that
3 - We can use _h.with_streams([1]).dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
_h.with_streams([2]).dot_mm(flat_dH, flat_input, out=dW, transa=True)
_h.sum_t(flat_dH, axis=0, out=dbias) # runs on default stream We should pick one and start working on it. |
Option 4: with _h.streams(1):
_h.dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
with _h.streams(2):
_h.dot_mm(flat_dH, flat_input, out=dW, transa=True)
_h.sum_t(flat_dH, axis=0, out=dbias)
_h.sum_t(flat_dH, axis=0, out=dbias) # runs on default stream Considering issue 2a we could do the following: say the handler internally uses 15 streams (0 - 14), but we group them in five groups of 3 streams |
@TimDettmers, this issue may be of interest. |
I will look into this and double buffering after I have taken a closer look at the codebase and the PyCUDA API. Double buffering is a bit more complicated, because even with streams there are synchronous parts when you do host -> GPU copies. |
Great! Let us know if you need any clarifications. There is some restructuring of layers going on in a branch right now, but this does not affect the overall architecture and philosophy. |
Coming back to this: I like option 3 the best. The problem with option 4 is that it gets too wordy too quickly. Especially considering that you'll often want to interleave ops on different streams. The initial example would become:
which doubles the line-count AND adds a lot of indentation. |
I agree. |
I think this should be post-release. It is important so it shouldn't be rushed. Let's set up a benchmarking suite first, and do a little bit of profiling. WRT Option3 Vs Option4: Actually those are not exclusive. If _h.with_streams([1]).dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
_h.with_streams([2]).dot_mm(flat_dH, flat_input, out=dW, transa=True)
_h.sum_t(flat_dH, axis=0, out=dbias) # runs on default stream
with _h.with_streams([1]) as h1:
h1.dot_add_mm(flat_dH, W, out=flat_in_delta_buffer)
h1.dot_mm(flat_dH, flat_input, out=dW, transa=True)
h1.sum_t(flat_dH, axis=0, out=dbias) |
Sooner or later, we should think about introducing CUDA streams for our GPU implementation. Side-Effect: Looking at the profiling outputs, across various example the most expensive call we make is usually the
set_from_numpy
call in the PyCudaHandler. We should be able to completely eliminate the cost of that call completely once we use streams, as the memory-transfers can all be done asynchronously (and we could finally implement a sensible double-buffering on GPUs).I can think of two ways to add Streams:
Specify Stream for each Call
Add a
stream=None
optional argument to all the handler functions, so that the caller can specify the stream on which to execute. When the stream is not specified, we run on the default stream. We could pass either real cuda-streams, or just stream-IDs (integers). Calls would then maybe look like this:Add a separate function for specifying streams:
In this short example, option 1 clearly looks better (IMO), but I can see option 2 working out nicely, too.
Another thing to consider is that we might set up some rules about streams. For example, something like "outputs should always be computed on streams 0-4"... or maybe it even makes sense to have different streams for outputs, internals and parameters, so we know which ones we need to synchronize on before starting computations in a new layer (or not, IDK).
The text was updated successfully, but these errors were encountered: