Additional/Reworked Codegen Passes #889

AlexRTer · 2024-10-30T22:56:13Z

This PR majorly reworks codegen for AllAgg* and EwOps as well as add lowering for TransposeOp and Row/ColAgg*.
All of these passes are added to the optional MLIR codegen pipeline that can be enabled using the --mlir-codegen flag and offer alternative lowering of these operations to MLIR rather than calls to precompiled C++ kernels. Currently, they only support DenseMatrix with dimensions that are known at compile-time and any value type (except Booleans).

Except for IdxMin, IdxMax which are directly lowered to affine loops and TransposeOp which lowers to a named linalg op all passes make use of linalg GenericOps which are then lowered to affine loops in a later pass in the codegen pipeline.
They convert the input DenseMatrix to a MemRef and create a new MemRef for the output that is converted into a DenseMatrix.

Changes:

Add codegen for AllAgg*Op, Row/ColAgg*Op, Ew*Op and TransposeOp (see below for details)
- Added passes to TableGen files and codegen pipeline
Added script level test cases / MLIR test cases (using FileCheck)
- Replaced old tests
- Renamed some old test scripts for EwOps for better organization
- Edited fusion.mlir test to lower Linalg to affine loops before applying fusion pass
Added Canonicalization passes for floor, ceil, round that removes the respective ops when input type is an integer (this also simplifies codegen)
Added some necessary instantiations in kernels.json
Restored alphabetic sorting of codegen passes in ir/daphneir/Passes.h

Ops with new codegen:

AllAgg*Op
- Sum, Min, Max
Row/ColAgg*Op
- Sum, Min, Max, IdxMin, IdxMax
Ew*Op
- Unary (scalar/matrix): Abs, Sqrt, Exp, Ln, Sin, Cos, Floor, Ceil, Round
- Binary (scalar-scalar/matrix-matrix/matrix-scalar broadcasting): Add, Sub, Mul, Div, Pow, Max, Min
TransposeOp

A small example of a lowered kernel:

// ./bin/daphne --mlir-codegen *.daphne
X = [1, 2, 3, 4, 5, 6](2, 3);
print(sum(X, 0));               // sumRow

The input is converted to a MemRef and a result MemRef is allocated. The first Linalg GenericOp initialized the result MemRef by copying the first row of the input and the second GenericOp iterates over the remaining values and applies the aggregation operation - an addition in this case.

#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d0, 0)>
...
    %7 = "daphne.convertDenseMatrixToMemRef"(%6) : (!daphne.Matrix<2x3xsi64>) -> memref<2x3xsi64>
    %alloc = memref.alloc() : memref<2x1xsi64>
    %intptr = memref.extract_aligned_pointer_as_index %alloc : memref<2x1xsi64> -> index
    %8 = "daphne.convertMemRefToDenseMatrix"(%intptr, %c0, %c2, %c1, %c1, %c1) : (index, index, index, index, index, index) -> !daphne.Matrix<2x1xsi64>
    
%subview = memref.subview %7[0, 0] [2, 1] [1, 1] : memref<2x3xsi64> to memref<2x1xsi64, strided<[3, 1]>>
    linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel", "parallel"]} ins(%subview : memref<2x1xsi64, strided<[3, 1]>>) outs(%alloc : memref<2x1xsi64>) {
    ^bb0(%in: si64, %out: si64):
      linalg.yield %in : si64
    }
    
%subview_0 = memref.subview %7[0, 1] [2, 2] [1, 1] : memref<2x3xsi64> to memref<2x2xsi64, strided<[3, 1], offset: 1>>
    linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "reduction"]} ins(%subview_0 : memref<2x2xsi64, strided<[3, 1], offset: 1>>) outs(%alloc : memref<2x1xsi64>) {
    ^bb0(%in: si64, %out: si64):
      %9 = linalg.index 0 : index
      %10 = memref.load %alloc[%9, %c0] : memref<2x1xsi64>
      %11 = builtin.unrealized_conversion_cast %in : si64 to i64
      %12 = builtin.unrealized_conversion_cast %10 : si64 to i64
      %13 = arith.addi %12, %11 : i64
      %14 = builtin.unrealized_conversion_cast %13 : i64 to si64
      linalg.yield %14 : si64
    }
...

Known Limitations:

Moving the LoopFusionPass below the LinalgToAffineLoopsPass enables some loop fusions already, but it seems to cause issues with e.g. TransposeOp. A simple example of this is X = [1,2,3](1,); print(t(X)); print(t(t(X)));. Hence, loop fusion has not been moved down yet.
Ew*Op broadcasting for singleton matrices currently has no canonicalizer pass to always move the singleton matrix to be the rhs operand. This should be handled separately though to take broadcasting for C++ kernels into account as well. (see Matrix broadcasting not working for 1x1 ∘ 1xn / nx1 matrices #803)
Dimensions for codegen Ops currently need to be known at compile-time. This is due to the way MemRefType is currently handled during conversion of the input Dense Matrix to a MemRef.
RewriteToCallKernelOpPass currently fails if IR contains math.ipowi or any trigonometric math op other than sin and cos, e.g. no kernels registered for operation 'ipowi'. Hence, the ewBinaryPow test currently fails. Before merging this should be fixed or commented out. The same issue persists for the currently commented out lowering for trigonometric math ops tan, asin, acos, atan, sinh, cosh, tanh in EwOpsLowering.cpp.

This PR majorly reworks codegen for AllAgg* and EwOps as well as add lowering for TransposeOp and Row/ColAgg*. All of these passes are added to the optional MLIR codegen pipeline that can be enabled using the --mlir-codegen flag and offer alternative lowering of these operations to MLIR rather than calls to precompiled C++ kernels. Currently, they only support DenseMatrix with dimensions that are known at compile-time and any value type (except Booleans). Except for IdxMin, IdxMax which are directly lowered to affine loops and TransposeOp which lowers to a named linalg op all passes make use of linalg GenericOps which are then lowered to affine loops in a later pass in the codegen pipeline. They convert the input DenseMatrix to a MemRef and create a new MemRef for the output that is converted into a DenseMatrix. Changes: - Add codegen for AllAgg*Op, Row/ColAgg*Op, Ew*Op and TransposeOp (see below for details) - Added passes to TableGen files and codegen pipeline - Added script level test cases / MLIR test cases (using FileCheck) - Replaced old tests Renamed some old test scripts for EwOps for better organization - Edited fusion.mlir test to lower Linalg to affine loops before applying fusion pass - Added Canonicalization passes for floor, ceil, round that removes the respective ops when input type is an integer (this also simplifies codegen) - Added some necessary instantiations in kernels.json - Restored alphabetic sorting of codegen passes in ir/daphneir/Passes.h Ops with new codegen: - AllAgg*Op Sum, Min, Max - Row/ColAgg*Op Sum, Min, Max, IdxMin, IdxMax - Ew*Op Unary (scalar/matrix): Abs, Sqrt, Exp, Ln, Sin, Cos, Floor, Ceil, Round Binary (scalar-scalar/matrix-matrix/matrix-scalar broadcasting): Add, Sub, Mul, Div, Pow, Max, Min - TransposeOp Known limitations are listed in the PR description [daphne-eu#889] Co-authored-by: philipportner [email protected]

philipportner

@AlexRTer thanks again for the PR.

Overall, this is very useful and improves our codegen quite a bit. The code also lgtm, I only squashed the commits and made a few smaller changes like moving a duplicated macro from the files into a function in LowerinUtils.h.

AlexRTer assigned philipportner Oct 30, 2024

AlexRTer mentioned this pull request Oct 30, 2024

Codegen for AggRow/Col summation #809

Closed

philipportner force-pushed the codegen-ops branch from 20cbc5d to e33162b Compare November 18, 2024 11:28

philipportner approved these changes Nov 18, 2024

View reviewed changes

philipportner merged commit 576bde3 into daphne-eu:main Nov 18, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional/Reworked Codegen Passes #889

Additional/Reworked Codegen Passes #889

AlexRTer commented Oct 30, 2024 •

edited

Loading

philipportner left a comment

Additional/Reworked Codegen Passes #889

Additional/Reworked Codegen Passes #889

Conversation

AlexRTer commented Oct 30, 2024 • edited Loading

philipportner left a comment

Choose a reason for hiding this comment

AlexRTer commented Oct 30, 2024 •

edited

Loading