Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional/Reworked Codegen Passes #889

Merged
merged 1 commit into from
Nov 18, 2024

Conversation

AlexRTer
Copy link
Collaborator

@AlexRTer AlexRTer commented Oct 30, 2024

This PR majorly reworks codegen for AllAgg* and EwOps as well as add lowering for TransposeOp and Row/ColAgg*.
All of these passes are added to the optional MLIR codegen pipeline that can be enabled using the --mlir-codegen flag and offer alternative lowering of these operations to MLIR rather than calls to precompiled C++ kernels. Currently, they only support DenseMatrix with dimensions that are known at compile-time and any value type (except Booleans).

Except for IdxMin, IdxMax which are directly lowered to affine loops and TransposeOp which lowers to a named linalg op all passes make use of linalg GenericOps which are then lowered to affine loops in a later pass in the codegen pipeline.
They convert the input DenseMatrix to a MemRef and create a new MemRef for the output that is converted into a DenseMatrix.

Changes:

  • Add codegen for AllAgg*Op, Row/ColAgg*Op, Ew*Op and TransposeOp (see below for details)
    • Added passes to TableGen files and codegen pipeline
  • Added script level test cases / MLIR test cases (using FileCheck)
    • Replaced old tests
    • Renamed some old test scripts for EwOps for better organization
    • Edited fusion.mlir test to lower Linalg to affine loops before applying fusion pass
  • Added Canonicalization passes for floor, ceil, round that removes the respective ops when input type is an integer (this also simplifies codegen)
  • Added some necessary instantiations in kernels.json
  • Restored alphabetic sorting of codegen passes in ir/daphneir/Passes.h

Ops with new codegen:

  • AllAgg*Op
    • Sum, Min, Max
  • Row/ColAgg*Op
    • Sum, Min, Max, IdxMin, IdxMax
  • Ew*Op
    • Unary (scalar/matrix): Abs, Sqrt, Exp, Ln, Sin, Cos, Floor, Ceil, Round
    • Binary (scalar-scalar/matrix-matrix/matrix-scalar broadcasting): Add, Sub, Mul, Div, Pow, Max, Min
  • TransposeOp

A small example of a lowered kernel:

// ./bin/daphne --mlir-codegen *.daphne
X = [1, 2, 3, 4, 5, 6](2, 3);
print(sum(X, 0));               // sumRow

The input is converted to a MemRef and a result MemRef is allocated. The first Linalg GenericOp initialized the result MemRef by copying the first row of the input and the second GenericOp iterates over the remaining values and applies the aggregation operation - an addition in this case.

#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d0, 0)>
...
    %7 = "daphne.convertDenseMatrixToMemRef"(%6) : (!daphne.Matrix<2x3xsi64>) -> memref<2x3xsi64>
    %alloc = memref.alloc() : memref<2x1xsi64>
    %intptr = memref.extract_aligned_pointer_as_index %alloc : memref<2x1xsi64> -> index
    %8 = "daphne.convertMemRefToDenseMatrix"(%intptr, %c0, %c2, %c1, %c1, %c1) : (index, index, index, index, index, index) -> !daphne.Matrix<2x1xsi64>
    
%subview = memref.subview %7[0, 0] [2, 1] [1, 1] : memref<2x3xsi64> to memref<2x1xsi64, strided<[3, 1]>>
    linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel", "parallel"]} ins(%subview : memref<2x1xsi64, strided<[3, 1]>>) outs(%alloc : memref<2x1xsi64>) {
    ^bb0(%in: si64, %out: si64):
      linalg.yield %in : si64
    }
    
%subview_0 = memref.subview %7[0, 1] [2, 2] [1, 1] : memref<2x3xsi64> to memref<2x2xsi64, strided<[3, 1], offset: 1>>
    linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "reduction"]} ins(%subview_0 : memref<2x2xsi64, strided<[3, 1], offset: 1>>) outs(%alloc : memref<2x1xsi64>) {
    ^bb0(%in: si64, %out: si64):
      %9 = linalg.index 0 : index
      %10 = memref.load %alloc[%9, %c0] : memref<2x1xsi64>
      %11 = builtin.unrealized_conversion_cast %in : si64 to i64
      %12 = builtin.unrealized_conversion_cast %10 : si64 to i64
      %13 = arith.addi %12, %11 : i64
      %14 = builtin.unrealized_conversion_cast %13 : i64 to si64
      linalg.yield %14 : si64
    }
...

Known Limitations:

  • Moving the LoopFusionPass below the LinalgToAffineLoopsPass enables some loop fusions already, but it seems to cause issues with e.g. TransposeOp. A simple example of this is X = [1,2,3](1,); print(t(X)); print(t(t(X)));. Hence, loop fusion has not been moved down yet.
  • Ew*Op broadcasting for singleton matrices currently has no canonicalizer pass to always move the singleton matrix to be the rhs operand. This should be handled separately though to take broadcasting for C++ kernels into account as well. (see Matrix broadcasting not working for 1x1 ∘ 1xn / nx1 matrices #803)
  • Dimensions for codegen Ops currently need to be known at compile-time. This is due to the way MemRefType is currently handled during conversion of the input Dense Matrix to a MemRef.
  • RewriteToCallKernelOpPass currently fails if IR contains math.ipowi or any trigonometric math op other than sin and cos, e.g. no kernels registered for operation 'ipowi'. Hence, the ewBinaryPow test currently fails. Before merging this should be fixed or commented out. The same issue persists for the currently commented out lowering for trigonometric math ops tan, asin, acos, atan, sinh, cosh, tanh in EwOpsLowering.cpp.

This PR majorly reworks codegen for AllAgg* and EwOps as well as add
lowering for TransposeOp and Row/ColAgg*. All of these passes are added
to the optional MLIR codegen pipeline that can be enabled using the
--mlir-codegen flag and offer alternative lowering of these operations
to MLIR rather than calls to precompiled C++ kernels. Currently, they
only support DenseMatrix with dimensions that are known at compile-time
and any value type (except Booleans).

Except for IdxMin, IdxMax which are directly lowered to affine loops and
TransposeOp which lowers to a named linalg op all passes make use of
linalg GenericOps which are then lowered to affine loops in a later pass
in the codegen pipeline. They convert the input DenseMatrix to a MemRef
and create a new MemRef for the output that is converted into a
DenseMatrix.

Changes:
- Add codegen for AllAgg*Op, Row/ColAgg*Op, Ew*Op and TransposeOp (see
  below for details)
- Added passes to TableGen files and codegen pipeline
- Added script level test cases / MLIR test cases (using FileCheck)
- Replaced old tests Renamed some old test scripts for EwOps for better
  organization
- Edited fusion.mlir test to lower Linalg to affine loops before
  applying fusion pass
- Added Canonicalization passes for floor, ceil, round that removes the
  respective ops when input type is an integer (this also simplifies
  codegen)
- Added some necessary instantiations in kernels.json
- Restored alphabetic sorting of codegen passes in ir/daphneir/Passes.h

Ops with new codegen:
- AllAgg*Op Sum, Min, Max
- Row/ColAgg*Op Sum, Min, Max, IdxMin, IdxMax
- Ew*Op Unary (scalar/matrix): Abs, Sqrt, Exp, Ln, Sin, Cos, Floor,
  Ceil, Round Binary (scalar-scalar/matrix-matrix/matrix-scalar
  broadcasting): Add, Sub, Mul, Div, Pow, Max, Min
- TransposeOp

Known limitations are listed in the PR description [daphne-eu#889]

Co-authored-by: philipportner [email protected]
Copy link
Collaborator

@philipportner philipportner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexRTer thanks again for the PR.

Overall, this is very useful and improves our codegen quite a bit. The code also lgtm, I only squashed the commits and made a few smaller changes like moving a duplicated macro from the files into a function in LowerinUtils.h.

@philipportner philipportner merged commit 576bde3 into daphne-eu:main Nov 18, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants