Why do we have packLoadBufferToStoreBuffer in GridwiseGemmToBlockwise ? #1132

manupak · 2023-06-27T13:54:05Z

manupak
Jun 27, 2023
Collaborator

I was reading through the code in GridwiseGemmToBlockwise around GridwiseGemm lowering...
and stumbled upon the packLoadBufferToStoreBuffer where we actually lays it out in registers differently.

Why do we do that as opposed to taking a view of (a.k.a. rock.transform (loadBufer) ) as I can see it is just an element re-order ?

cc : @krzysz00 @giuseros @jerryyin

krzysz00 · 2023-06-27T14:05:50Z

krzysz00
Jun 27, 2023
Maintainer

The reason is that we want to vectorize our global loads, which might produce in in-register layout that's different from what's best for vectorizing the stores to LDS. It's cheaper and more efficient to transpose in registers than to fiddle with the loads.

10 replies

krzysz00 Jun 27, 2023
Maintainer

I don't think that's right. The point of the packing code is to handle the differences between how to efficiently load from global memory and how to efficiently store to LDS

manupak Jun 27, 2023
Collaborator Author

But I can see why you need two sets of register buffers for the double buffering though...

The point of the packing code is to handle the differences between how to efficiently load from global memory and how to efficiently store to LDS

Out of curiosity, does the LDS-friendly re-layout exercise was found to be profitable ? (i.e. re-layout cost < LDS efficiency loss)
I mean we can still do double buffering by looking at two identical buffers (viewed differently for LDS store) -- ping-pong style.

giuseros Jun 27, 2023
Collaborator

I think there are two separate questions:
a) Can we have a single buffer instead of two separate loadBuffer and storeBuffer? I think we need two buffers because we are double buffering
b) Can we do packing on the fly? I.e., load a vector from global into tmp0, tmp1= vector::extract(tmp0) and then lds_write(tmp1)? We were doing this before, but I thought more optimal to load all the (untransposed) vector into a buffer and then do a reg-to-reg operation to do the transpose

giuseros Jun 27, 2023
Collaborator

Out of curiosity, does the LDS-friendly re-layout exercise was found to be profitable ? (i.e. re-layout cost < LDS efficiency loss)
I mean we can still do double buffering by looking at two identical buffers (viewed differently for LDS store) -- ping-pong style.

But we want to write vectors in LDS, so your registers need to be in the LDS friendly layout and then you call ds_buffer_write_128 to store them in LDS

manupak Jun 27, 2023
Collaborator Author

a) Can we have a single buffer instead of two separate loadBuffer and storeBuffer? I think we need two buffers because we are double buffering

agreed.

b) Can we do packing on the fly? I.e., load a vector from global into tmp0, tmp1= vector::extract(tmp0) and then lds_write(tmp1)? We were doing this before, but I thought more optimal to load all the (untransposed) vector into a buffer and then do a reg-to-reg operation to do the transpose

Yes -- Im more curious about the benefit of reg-to-reg transpose done on registers. did you see a benefit after that change (as opposed to before) ? -- Otherwise, a conventional software pipelining optimization pass could be applied nicely here, hence Im asking the question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do we have packLoadBufferToStoreBuffer in GridwiseGemmToBlockwise ? #1132

{{title}}

Replies: 1 comment 10 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Why do we have packLoadBufferToStoreBuffer in GridwiseGemmToBlockwise ? #1132

manupak Jun 27, 2023 Collaborator

Replies: 1 comment · 10 replies

krzysz00 Jun 27, 2023 Maintainer

krzysz00 Jun 27, 2023 Maintainer

manupak Jun 27, 2023 Collaborator Author

giuseros Jun 27, 2023 Collaborator

giuseros Jun 27, 2023 Collaborator

manupak Jun 27, 2023 Collaborator Author

manupak
Jun 27, 2023
Collaborator

Replies: 1 comment 10 replies

krzysz00
Jun 27, 2023
Maintainer

krzysz00 Jun 27, 2023
Maintainer

manupak Jun 27, 2023
Collaborator Author

giuseros Jun 27, 2023
Collaborator

giuseros Jun 27, 2023
Collaborator

manupak Jun 27, 2023
Collaborator Author