Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mlir-tensorrt] Add support for non-DPS calling convention #258

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jhalakpatel
Copy link
Collaborator

@jhalakpatel jhalakpatel commented Oct 8, 2024

This commit introduces support for a non-Destination-Passing Style (non-DPS) calling convention in mlir-tensorrt, while maintaining the existing DPS-style interface.

The changes aim to allow users to compile and execute a mlir-tensorrt executable without allocating output memrefs in advance. Removing the output memref allocation restriction alleviates users from computing output shapes in advance and allocating output memrefs. This is critical since it is part of the performance-critical execution loop.

Deferred output allocation has an added advantage as we no longer need to allocate shape upper bound output buffers and copy the exact output buffer from TensorRT results into mlir-tensorrt output buffers.

This approach also allows us to support data-dependent shapes since the outputs are not required to be allocated before execution.

The non-DPS style calling convention implementation leverages the nvinfer1::IOutputAllocator interface for deferred output allocation. Interface functions such as reallocateOutputAsync and notifyShapes store the allocated output buffer address and its shape.

User call sites with existing DPS-style calling convention:

        out_args = ...  # Requires output shape calculation and output buffer allocation before execution.
        session.execute_function(
            "main", in_args=in_args, out_args=out_args, stream=self.stream._active_cuda_stream
        )

Now, we can do the following with non-DPS style convention:

        outputs = session.execute_function(
            "main", in_args=in_args, stream=self.stream._active_cuda_stream
        )

Key changes include:

  1. Update the Plan dialect to implement non-DPS calling convention:
    • Updated PlanAllocTensorsPass and CreateClosedRegionsPass to handle non-DPS convention
    • Update Plan transformation to convert DPS and non-DPS group ops to CallAllocOp and CallOp respectively.
  2. Enhanced TensorRT runtime support:
    • Modified ConvertTensorRTToTensorRTRuntime to support both calling conventions
    • Added EnqueueAllocOp for non-DPS style execution
    • Added conversion from EnqueueAllocOp to executor CallOp.
    • Implemented OutputAllocator and CustomTensorRTOuputAllocator classes with proper lifetime management using OutputAllocatorTracker.
  3. Updated API and runtime interfaces:
    • Modified executeFunctionWithLuaBackend to support both DPS and non-DPS styles
    • Updated Python bindings to allow non-DPS styles.

@jhalakpatel jhalakpatel marked this pull request as draft October 8, 2024 17:00
@jhalakpatel jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 6 times, most recently from dbe281b to 0a8dc8e Compare October 9, 2024 18:24
@jhalakpatel jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 4 times, most recently from a669203 to 36c143c Compare October 11, 2024 01:12
if (result.getType().isa<TensorType>() != allTensors) {
return emitOpError("all results must be of the same type (all tensors "
"or all memrefs)");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also be verifying the layout (stride + offset) information on memref results, e.g.

trtrt.alloc_enqueue .... -> memref<?x?x?xf32>

implies the identity layout (canonical row major strides).

trtrt.alloc_enqueue .... -> memref<?x?x?xf32, strided<[?, ?, ?, ?], offset: ?>>

indicates that the strides are unknown.

Not being able to know anything about the strides is very worst-case since it disables many possible optimizations.

We need this information from TensorRT -- what layouts of results are possible/allowed? Can we enforce that canonical strides will always be returned from TensorRT using the output allocator? If we can, then the verifier should enforce that canonical layouts are used.

Currently for trtrt.enqueue we are effectively enforcing canonical strides for input and output buffers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the other thread, MLIR-TRT does not use the nvinfer1::setAllowedFormat API, which allows formats other than nvinfer1::TensorFormat::kLINEAR. So, for all practical purposes, stride here is canonical.
I will add a check in the verifier for canonical strides. Let me know if you meant this as I might have misunderstood you.

Also, How can I generate this assembly format: memref<?x?x?xf32, strided<[?, ?, ?, ?], offset: ?>>?

@jhalakpatel jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 7 times, most recently from 3844712 to dd26988 Compare October 14, 2024 00:18
@jhalakpatel jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 6 times, most recently from ef2bddf to 8a9ae8c Compare October 14, 2024 23:38
@jhalakpatel jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 8 times, most recently from b8643e5 to 0c89615 Compare October 17, 2024 22:10
@jhalakpatel
Copy link
Collaborator Author

This PR will eventually be closed as it has been broken down to smaller PRs:

Non-DPS style functional changes are open for review. I have broken them down to more manageable PR's for review:

  1. Output allocator in TensorRT Module: [Lua/TensorRTModule] Implement OutputAllocator for TensorRT execution #282
  2. Some improvements to variable names: [NFC] Improve various variable names #283
  3. Adding an option to dump textual pipeline: [Compiler] Split compiler pipeline into pass pipeline, compilation, and translation steps #284
  4. Update Plan dialect to support non-DPS calling convention: [Dialect/Plan] Update Plan dialect to use non-dps calling convention #285
  5. Update TensorRT runtime dialect to use non-dps calling convention: [Dialect/TensorRTRuntime] Update tensorrt runtime dialect to use non-dps calling convention #286
  6. Update TensorRT dialect: [Dialect/TensorRT] Update tensorrrt dialect to use non-dps calling convention #287
  7. Update Lua Runtime: [Runtime/Lua] Support multiple results in executeFunctionWithLuaBackend #288

@jhalakpatel jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 10 times, most recently from cfeefc6 to 9118b9a Compare October 27, 2024 01:18
@jhalakpatel jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 3 times, most recently from 15a15bf to e9839c1 Compare November 4, 2024 19:47
@jhalakpatel jhalakpatel force-pushed the jhalakp-alloc-enqueue branch 3 times, most recently from 3478d82 to d8d2f81 Compare November 4, 2024 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants