Questions about warp group memory layout #77

Devil-SX · 2025-01-01T04:43:46Z

I really like this work. However, I'm having some difficulty understanding the shared memory layouts described in Appendix C.

Why does the Naive Swizzled Layout lack hardware support for HGMMA and UTMA instructions, while 32-byte and 64-byte swizzling configurations do support them? What is the memory layout satisfied warp group instructions?

Here are my understanding:

ThunderKittens memory format abstraction is designed to support both row-major and column-major 16x16 tiles, allowing programmers to avoid focusing on the specific layouts of matrices A or B.
Shared Memory comprises 32 banks, each capable of providing 32-bit bandwidth per cycle. If a row or column within the 16x16 tiles accesses more than two 16bit elements in the same bank, it leads to bank conflicts.
64x32 16-bit Example in Appendix C corresponds to the maximum data required for a warp group, which includes:
- 4 warps, 1 Tensor Core per warp
- 2 16x16 input matrices per Tensor Core
- 8 total 16x16 input matrices -> 64x32 16 bit tile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about warp group memory layout #77

Questions about warp group memory layout #77

Devil-SX commented Jan 1, 2025 •

edited

Loading

Questions about warp group memory layout #77

Questions about warp group memory layout #77

Comments

Devil-SX commented Jan 1, 2025 • edited Loading

Devil-SX commented Jan 1, 2025 •

edited

Loading