[mlir-tensorrt] Add support for non-DPS calling convention #258

jhalakpatel · 2024-10-08T16:59:47Z

This commit introduces support for a non-Destination-Passing Style (non-DPS) calling convention in mlir-tensorrt, while maintaining the existing DPS-style interface.

The changes aim to allow users to compile and execute a mlir-tensorrt executable without allocating output memrefs in advance. Removing the output memref allocation restriction alleviates users from computing output shapes in advance and allocating output memrefs. This is critical since it is part of the performance-critical execution loop.

Deferred output allocation has an added advantage as we no longer need to allocate shape upper bound output buffers and copy the exact output buffer from TensorRT results into mlir-tensorrt output buffers.

This approach also allows us to support data-dependent shapes since the outputs are not required to be allocated before execution.

The non-DPS style calling convention implementation leverages the nvinfer1::IOutputAllocator interface for deferred output allocation. Interface functions such as reallocateOutputAsync and notifyShapes store the allocated output buffer address and its shape.

User call sites with existing DPS-style calling convention:

        out_args = ...  # Requires output shape calculation and output buffer allocation before execution.
        session.execute_function(
            "main", in_args=in_args, out_args=out_args, stream=self.stream._active_cuda_stream
        )

Now, we can do the following with non-DPS style convention:

        outputs = session.execute_function(
            "main", in_args=in_args, stream=self.stream._active_cuda_stream
        )

Key changes include:

Update the Plan dialect to implement non-DPS calling convention:
- Updated PlanAllocTensorsPass and CreateClosedRegionsPass to handle non-DPS convention
- Update Plan transformation to convert DPS and non-DPS group ops to CallAllocOp and CallOp respectively.
Enhanced TensorRT runtime support:
- Modified ConvertTensorRTToTensorRTRuntime to support both calling conventions
- Added EnqueueAllocOp for non-DPS style execution
- Added conversion from EnqueueAllocOp to executor CallOp.
- Implemented OutputAllocator and CustomTensorRTOuputAllocator classes with proper lifetime management using OutputAllocatorTracker.
Updated API and runtime interfaces:
- Modified executeFunctionWithLuaBackend to support both DPS and non-DPS styles
- Updated Python bindings to allow non-DPS styles.

mlir-tensorrt/compiler/lib/Compiler/StableHloToExecutable.cpp

mlir-tensorrt/executor/include/mlir-executor-c/Runtime/Runtime.h

mlir-tensorrt/executor/include/mlir-executor/Support/Allocators.h

mlir-tensorrt/executor/lib/Runtime/Backend/Lua/Modules/TensorRT/TensorRTModule.cpp

mlir-tensorrt/compiler/include/mlir-tensorrt/Dialect/TensorRTRuntime/IR/TensorRTRuntimeOps.td

mlir-tensorrt/compiler/lib/Dialect/TensorRTRuntime/IR/TensorRTRuntime.cpp

christopherbate · 2024-10-11T03:22:07Z

mlir-tensorrt/compiler/lib/Dialect/TensorRTRuntime/IR/TensorRTRuntime.cpp

+      if (result.getType().isa<TensorType>() != allTensors) {
+        return emitOpError("all results must be of the same type (all tensors "
+                           "or all memrefs)");
+      }


We should also be verifying the layout (stride + offset) information on memref results, e.g.

trtrt.alloc_enqueue .... -> memref<?x?x?xf32>

implies the identity layout (canonical row major strides).

trtrt.alloc_enqueue .... -> memref<?x?x?xf32, strided<[?, ?, ?, ?], offset: ?>>

indicates that the strides are unknown.

Not being able to know anything about the strides is very worst-case since it disables many possible optimizations.

We need this information from TensorRT -- what layouts of results are possible/allowed? Can we enforce that canonical strides will always be returned from TensorRT using the output allocator? If we can, then the verifier should enforce that canonical layouts are used.

Currently for trtrt.enqueue we are effectively enforcing canonical strides for input and output buffers.

As I mentioned in the other thread, MLIR-TRT does not use the nvinfer1::setAllowedFormat API, which allows formats other than nvinfer1::TensorFormat::kLINEAR. So, for all practical purposes, stride here is canonical.
I will add a check in the verifier for canonical strides. Let me know if you meant this as I might have misunderstood you.

Also, How can I generate this assembly format: memref<?x?x?xf32, strided<[?, ?, ?, ?], offset: ?>>?

mlir-tensorrt/compiler/lib/Dialect/TensorRTRuntime/IR/TensorRTRuntime.cpp

mlir-tensorrt/compiler/lib/Dialect/TensorRTRuntime/Transforms/BufferizableOpInterfaceImpl.cpp

mlir-tensorrt/compiler/include/mlir-tensorrt/Compiler/Options.h

jhalakpatel · 2024-10-18T02:26:44Z

This PR will eventually be closed as it has been broken down to smaller PRs:

Non-DPS style functional changes are open for review. I have broken them down to more manageable PR's for review:

Output allocator in TensorRT Module: [Lua/TensorRTModule] Implement OutputAllocator for TensorRT execution #282
Some improvements to variable names: [NFC] Improve various variable names #283
Adding an option to dump textual pipeline: [Compiler] Split compiler pipeline into pass pipeline, compilation, and translation steps #284
Update Plan dialect to support non-DPS calling convention: [Dialect/Plan] Update Plan dialect to use non-dps calling convention #285
Update TensorRT runtime dialect to use non-dps calling convention: [Dialect/TensorRTRuntime] Update tensorrt runtime dialect to use non-dps calling convention #286
Update TensorRT dialect: [Dialect/TensorRT] Update tensorrrt dialect to use non-dps calling convention #287
Update Lua Runtime: [Runtime/Lua] Support multiple results in executeFunctionWithLuaBackend #288

jhalakpatel marked this pull request as draft October 8, 2024 17:00

jhalakpatel force-pushed the jhalakp-alloc-enqueue branch from 9760adc to 63c8339 Compare October 8, 2024 17:02

jhalakpatel commented Oct 8, 2024

View reviewed changes

mlir-tensorrt/compiler/lib/Compiler/StableHloToExecutable.cpp Outdated Show resolved Hide resolved

jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 6 times, most recently from dbe281b to 0a8dc8e Compare October 9, 2024 18:24

christopherbate requested changes Oct 9, 2024

View reviewed changes

jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 4 times, most recently from a669203 to 36c143c Compare October 11, 2024 01:12

christopherbate reviewed Oct 11, 2024

View reviewed changes

mlir-tensorrt/compiler/include/mlir-tensorrt/Dialect/TensorRTRuntime/IR/TensorRTRuntimeOps.td Outdated Show resolved Hide resolved

christopherbate requested changes Oct 11, 2024

View reviewed changes

jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 7 times, most recently from 3844712 to dd26988 Compare October 14, 2024 00:18

jhalakpatel commented Oct 14, 2024

View reviewed changes

mlir-tensorrt/compiler/include/mlir-tensorrt/Compiler/Options.h Outdated Show resolved Hide resolved

jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 6 times, most recently from ef2bddf to 8a9ae8c Compare October 14, 2024 23:38

jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 8 times, most recently from b8643e5 to 0c89615 Compare October 17, 2024 22:10

jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 10 times, most recently from cfeefc6 to 9118b9a Compare October 27, 2024 01:18

jhalakpatel force-pushed the jhalakp-alloc-enqueue branch from 9118b9a to 01f0518 Compare November 1, 2024 17:20

jhalakpatel requested a review from shelkesagar29 as a code owner November 1, 2024 17:20

jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 3 times, most recently from 15a15bf to e9839c1 Compare November 4, 2024 19:47

Tripy changes for non-DPS

4c7b5b8

jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 3 times, most recently from 3478d82 to d8d2f81 Compare November 4, 2024 20:08

Add initial IR for alloc enqueue

5ad00f2

jhalakpatel force-pushed the jhalakp-alloc-enqueue branch from d8d2f81 to 5ad00f2 Compare November 4, 2024 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mlir-tensorrt] Add support for non-DPS calling convention #258

[mlir-tensorrt] Add support for non-DPS calling convention #258

jhalakpatel commented Oct 8, 2024 •

edited

Loading

christopherbate Oct 11, 2024

jhalakpatel Oct 11, 2024

jhalakpatel commented Oct 18, 2024

[mlir-tensorrt] Add support for non-DPS calling convention #258

Are you sure you want to change the base?

[mlir-tensorrt] Add support for non-DPS calling convention #258

Conversation

jhalakpatel commented Oct 8, 2024 • edited Loading

christopherbate Oct 11, 2024

Choose a reason for hiding this comment

jhalakpatel Oct 11, 2024

Choose a reason for hiding this comment

jhalakpatel commented Oct 18, 2024

jhalakpatel commented Oct 8, 2024 •

edited

Loading