You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I really like this work. However, I'm having some difficulty understanding the shared memory layouts described in Appendix C.
Why does the Naive Swizzled Layout lack hardware support for HGMMA and UTMA instructions, while 32-byte and 64-byte swizzling configurations do support them? What is the memory layout satisfied warp group instructions?
Here are my understanding:
ThunderKittens memory format abstraction is designed to support both row-major and column-major 16x16 tiles, allowing programmers to avoid focusing on the specific layouts of matrices A or B.
Shared Memory comprises 32 banks, each capable of providing 32-bit bandwidth per cycle. If a row or column within the 16x16 tiles accesses more than two 16bit elements in the same bank, it leads to bank conflicts.
64x32 16-bit Example in Appendix C corresponds to the maximum data required for a warp group, which includes:
4 warps, 1 Tensor Core per warp
2 16x16 input matrices per Tensor Core
8 total 16x16 input matrices -> 64x32 16 bit tile
The text was updated successfully, but these errors were encountered:
I really like this work. However, I'm having some difficulty understanding the shared memory layouts described in Appendix C.
Why does the Naive Swizzled Layout lack hardware support for HGMMA and UTMA instructions, while 32-byte and 64-byte swizzling configurations do support them? What is the memory layout satisfied warp group instructions?
Here are my understanding:
ThunderKittens memory format abstraction is designed to support both row-major and column-major 16x16 tiles, allowing programmers to avoid focusing on the specific layouts of matrices A or B.
Shared Memory comprises 32 banks, each capable of providing 32-bit bandwidth per cycle. If a row or column within the 16x16 tiles accesses more than two 16bit elements in the same bank, it leads to bank conflicts.
64x32 16-bit Example in Appendix C corresponds to the maximum data required for a warp group, which includes:
The text was updated successfully, but these errors were encountered: