Replies: 1 comment 10 replies
-
The reason is that we want to vectorize our global loads, which might produce in in-register layout that's different from what's best for vectorizing the stores to LDS. It's cheaper and more efficient to transpose in registers than to fiddle with the loads. |
Beta Was this translation helpful? Give feedback.
10 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I was reading through the code in GridwiseGemmToBlockwise around GridwiseGemm lowering...
and stumbled upon the packLoadBufferToStoreBuffer where we actually lays it out in registers differently.
Why do we do that as opposed to taking a view of (a.k.a. rock.transform (loadBufer) ) as I can see it is just an element re-order ?
cc : @krzysz00 @giuseros @jerryyin
Beta Was this translation helpful? Give feedback.
All reactions