forked from NVIDIA/cutlass
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgemm_help_menu_profiler.txt
78 lines (64 loc) · 5.14 KB
/
gemm_help_menu_profiler.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
$ tools/profiler/cutlass_profiler --operation=Gemm --help
EA: Also recall the suggested profiler command:
./tools/profiler/cutlass_profiler
--operation=gemm
--n=384 --m=106496 --k=16384
--A=bf16:row --B=bf16:column --C=bf16:column
--output=cutlass_profile_ffn1_384_transposed.csv
GEMM
[enum] --gemm_kind Variant of GEMM (universal, gemm, planar_complex, planar_complex_array)
[int] --m,--problem-size::m M dimension of the GEMM problem space
[int] --n,--problem-size::n N dimension of the GEMM problem space
[int] --k,--problem-size::k K dimension of the GEMM problem space
[tensor] --A Tensor storing the A operand
[tensor] --B Tensor storing the B operand
[tensor] --C Tensor storing the C operand
[tensor] --D Tensor storing the D output
[scalar] --alpha,--epilogue::alpha Epilogue scalar alpha
[scalar] --beta,--epilogue::beta Epilogue scalar beta
[enum] --split_k_mode,--split-k-mode Variant of split K mode(serial, parallel)
[int] --split_k_slices,--split-k-slices Number of partitions of K dimension
[int] --batch_count,--batch-count Number of GEMMs computed in one batch
[enum] --raster_order,--raster-order Raster order (heuristic, along_n, along_m)
[int] --swizzle_size,--swizzle-size Size to swizzle
[enum] --op_class,--opcode-class Class of math instruction (simt, tensorop, wmmatensorop, wmma)
[enum] --accum,--accumulator-type Math instruction accumulator data type
[int] --cta_m,--threadblock-shape::m Threadblock shape in the M dimension
[int] --cta_n,--threadblock-shape::n Threadblock shape in the N dimension
[int] --cta_k,--threadblock-shape::k Threadblock shape in the K dimension
[int] --cluster_m,--cluster-shape::m Cluster shape in the M dimension
[int] --cluster_n,--cluster-shape::n Cluster shape in the N dimension
[int] --cluster_k,--cluster-shape::k Cluster shape in the K dimension
[int] --stages,--threadblock-stages Number of stages of threadblock-scoped matrix multiply
[int] --warps_m,--warp-count::m Number of warps within threadblock along the M dimension
[int] --warps_n,--warp-count::n Number of warps within threadblock along the N dimension
[int] --warps_k,--warp-count::k Number of warps within threadblock along the K dimension
[int] --inst_m,--instruction-shape::m Math instruction shape in the M dimension
[int] --inst_n,--instruction-shape::n Math instruction shape in the N dimension
[int] --inst_k,--instruction-shape::k Math instruction shape in the K dimension
[int] --min_cc,--minimum-compute-capability Minimum device compute capability
[int] --max_cc,--maximum-compute-capability Maximum device compute capability
Examples:
Profile a particular problem size:
$ cutlass_profiler --operation=Gemm --m=1024 --n=1024 --k=128
Schmoo over problem size and beta:
$ cutlass_profiler --operation=Gemm --m=1024:4096:256 --n=1024:4096:256 --k=128:8192:128 --beta=0,1,2.5
Schmoo over accumulator types:
$ cutlass_profiler --operation=Gemm --accumulator-type=f16,f32
Run when A is f16 with column-major and B is any datatype with row-major (For column major, use column, col, or n. For row major use, row or t):
$ cutlass_profiler --operation=Gemm --A=f16:column --B=*:row
Profile a particular problem size with split K and parallel reduction:
$ cutlass_profiler --operation=Gemm --split_k_mode=parallel --split_k_slices=2 --m=1024 --n=1024 --k=128
Using various input value distribution:
$ cutlass_profiler --operation=Gemm --dist=uniform,min:0,max:3
$ cutlass_profiler --operation=Gemm --dist=gaussian,mean:0,stddev:3
$ cutlass_profiler --operation=Gemm --dist=sequential,start:0,delta:1
Run a kernel with cta tile size of 256x128x32 and save workspace if results are incorrect (note that --cta-tile::k=32 is default cta-tile size):
$ cutlass_profiler --operation=Gemm --cta_m=256 --cta_n=128 --cta_k=32 --save-workspace=incorrect
Test your changes to gemm kernels with a quick functional test and save results in functional-test.csv:
$ cutlass_profiler --operation=Gemm \
--m=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \
--n=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \
--k=8,16,32,64,128,256,288,384,504,512,520 \
--beta=0,1,2 --profiling-iterations=1 \
--providers=cutlass --output=functional-test.csv