Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async execute is not run concurrently #7888

Open
ShuaiShao93 opened this issue Dec 17, 2024 · 4 comments
Open

async execute is not run concurrently #7888

ShuaiShao93 opened this issue Dec 17, 2024 · 4 comments

Comments

@ShuaiShao93
Copy link

Description
We have a Python BLS model that calls into another model. This BLS model is just a thin wrapper, and we use await infer_request.async_exec(). In this case, the async function should be handle multiple requests concurrently when it's waiting for the async_exec.

However, we notice there is backlog on this BLS model rather than the actual backend model, which means requests are not processed concurrently.

Triton Information
24.11

To Reproduce

  1. Define a Python model with async execute and call into another model. The python model is very lightweight, while the backend model is much slower.
  2. Start sending concurrent requests with batch_size=1
  3. Fetch the metrics and check the queue size of each model
  4. Noticed that the queue size of the BLS model is increasing, while the queue size of the backend model is always 0

Expected behavior
If the BLS async model can handle concurrent requests, the backlog should happen on the backend model rather than the BLS model

@fighterhit
Copy link

fighterhit commented Dec 19, 2024

+1. We also encountered this problem in NGC triton server 23.12. I suspect that the underlying async_exec is not truly asynchronous and will block the event loop of the python backend.
I had to change to using triton grpc aio client to call the same triton server's another tensorflow model in BLS python backend, which can temporarily solve this problem. But each grpc connection will be destroyed because the event loop is closed, so I have to create a connection every time in async def execute(). Is there a way to reuse the same grpc connection in asynchronous mode? But essentially the blocking problem of infer_request.async_exec() should be solved.

@Tabrizian @okdimok @oandreeva-nv PTAL, thanks!

@oandreeva-nv
Copy link
Contributor

Thanks for the issue, is it possible to share some code? this will significantly speed up the debugging process

@ShuaiShao93
Copy link
Author

Thanks for the issue, is it possible to share some code? this will significantly speed up the debugging process

I believe you can use this model in non-decoupled mode:

class TritonPythonModel:
async def execute(self, requests):
is_decoupled = True if os.environ["BLS_KIND"] == "decoupled" else False
responses = []
for _ in requests:
if is_decoupled:
test1 = await multiple_async_bls_square(gpu=True)
test2 = await multiple_async_bls_square(gpu=False)
test3 = await async_bls_square()
else:
test1 = await multiple_async_bls_addsub(gpu=True)
test2 = await multiple_async_bls_addsub(gpu=False)
test3 = await async_bls_add_sub()
responses.append(
pb_utils.InferenceResponse(
output_tensors=[
pb_utils.Tensor("OUTPUT0", np.array([test1 & test2 & test3]))
]
)
)
. Maybe just add some sleep in the backend model to showcase this better.

@fighterhit
Copy link

fighterhit commented Dec 21, 2024

Hi @oandreeva-nv , When I used InferRequest.async_exec, I profiled the python backend process with py-spy. Here is the flame graph.

image

It can be seen that the CPU is mostly consumed at https://github.com/python/cpython/blob/3.10/Lib/concurrent/futures/thread.py#L81 and https://github.com/python/cpython/blob/3.10/Lib/concurrent/futures/thread.py#L58. The underlying are PyThread_acquire_lock_timed (libpython3.10.so.1.0) and pthread_cond_timedwait (libc.so.6).

I took a quick look at the async_exec of python backend implementation.

I am not sure if it is cpu intensive and causes the blocking, but when I replaced it with aio grpc client, the blocking disappeared. FYI.

import tritonclient.grpc.aio as grpcclient

class TritonPythonModel: 
     async def execute(self, requests): 
        //... some logic
        
        // infer_request.async_exec() // Blocking. Not work.

        //Aio GRPC Work. But the connection has to be established every time to avoid the client being unable to be reused due to the event loop closed.
        triton_client = grpcclient.InferenceServerClient(url="localhost:10502")
        results = await triton_client.infer(
            model_name="the model in the same triton server",
            inputs=inputs,
            outputs=outputs
        )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants