Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AI] Add support for Object Detection pipeline #3228

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

RUFFY-369
Copy link

@RUFFY-369 RUFFY-369 commented Nov 1, 2024

What does this pull request do? Explain your changes. (required)

Adds support for the ai-worker pipeline which implements the (real-time) RT-DETR object-detection model (default)

Corresponding AI-Worker PR: livepeer/ai-worker#243
Specific updates (required)

How did you test each of these updates (required)

Test was performed by running the gateway + worker locally.

Does this pull request close any open issues?

Checklist:

cc @rickstaa

@RUFFY-369 RUFFY-369 requested a review from rickstaa as a code owner November 1, 2024 15:19
@github-actions github-actions bot added the AI Issues and PR related to the AI-video branch. label Nov 1, 2024
@leszko leszko deleted the branch livepeer:master November 7, 2024 08:26
@leszko leszko closed this Nov 7, 2024
@rickstaa rickstaa reopened this Nov 13, 2024
@rickstaa rickstaa changed the base branch from ai-video to master November 13, 2024 21:55
@ad-astra-video
Copy link
Collaborator

ad-astra-video commented Nov 18, 2024

Did an initial look through and it looks pretty good. Will update tomorrow when I can pull and run it locally. Thank you for adding tests!

One question, would it make sense to have an option to only return the detections text response? The transcoding of the frames is CPU transcoding so will be pretty slow. @rickstaa or @leszko looking at doing nvidia transcoding is still a bit in the future right?

@leszko
Copy link
Contributor

leszko commented Nov 18, 2024

One question, would it make sense to have an option to only return the detections text response? The transcoding of the frames is CPU transcoding so will be pretty slow. @rickstaa or @leszko looking at doing nvidia transcoding is still a bit in the future right?

I don't have enough context to answer this. @rickstaa may know more.

@RUFFY-369
Copy link
Author

RUFFY-369 commented Nov 18, 2024

One question, would it make sense to have an option to only return the detections text response? The transcoding of the frames is CPU transcoding so will be pretty slow. @rickstaa or @leszko looking at doing nvidia transcoding is still a bit in the future right?

If @rickstaa comments on the future of cpu transcoding then I can push a commit adding labels_only header to the API to avoid calling transcodeFrames.
PS If agreed after discussion then the header can be added after performing E2E testing of the pipeline in this state of the PR.

@rickstaa
Copy link
Member

@leszko, @ad-astra-video I haven’t planned to replace the CPU for transcoding yet. I was considering holding off on that until it becomes necessary for the real-time version of this pipeline. I think for now @ad-astra-video's solution makes sense.

@ad-astra-video
Copy link
Collaborator

@RUFFY-369 I have looked through the code and ran it E2E (with updates below) in docker. You did a good job getting all the remote worker parts together!

Some updates I sent in a PR to your branch, feel free to merge or use as a guide to adjust your branch:

  1. I updated ai_http.go to use ffmpeg.GetCodecInfoBytes to get the outpixels calculation. Also updated ai-worker to use LPMS function because ffprobe is not installed in docker container.
  2. Updated core/ai_worker.go to transcode all the frames to one MP4. It was transcoding each frame individually in separate mp4s I think before the update.
    2a. I also updated to try and guess an appropriate bitrate by using the ffmpeg transcode profiles in LPMS. This is not perfect but is better than assigning one bitrate for all resolutions. We can improve this in a future PR.

Please update to add:

  1. Return the detection data to the user, it is dropped with only the video returning right now.
  2. Some additional suggestions in the ai-worker PR for some things to add to the data returned.

Questions:

  1. Looks like similar APIs are charging by the minute which I think is easier conceptually than pricing based on pixels. Do you think pricing by seconds makes sense on this pipeline? Do you know if inference time changes significantly based on input size? To note, we already have the duration from the input pulling from ffmpeg for the audio-to-text pipeline so should be pretty low lift to change to pricing by the second (or millisecond used by audio-to-text pipeline). cc @rickstaa if has some thoughts on pricing of this pipeline?
  2. Do you know if there is a good way to render the boxes client side? In my research the html5 video tag seems to not be the same across browsers when a new frame is displayed or signaling time updates (firefox looks like every frame, chrome/safari every 200-250ms). Maybe seeing the detected items is mostly for debugging/confirming the model is working and most would just want the detection data. That said, I think is good to be able to return both if wanted with the batch processing API.

If you want to try out, my docker builds are adastravideo/go-livepeer:object-detection and adastravideo/ai-runner:object-detection. Attaching the result of one detection run for reference:

e2b0f489.mp4

core/ai_worker.go Outdated Show resolved Hide resolved
core/ai_worker.go Outdated Show resolved Hide resolved
server/ai_http.go Outdated Show resolved Hide resolved
@@ -257,3 +259,5 @@ require (
lukechampine.com/blake3 v1.2.1 // indirect
rsc.io/tmplfunc v0.0.3 // indirect
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to remove before merging.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just give me a heads up before we merge and I will get it removed in the last commit

@@ -48,6 +48,31 @@ const (
Complete ImageToVideoStatus = "complete"
)

type ObjectDetectionResponseAsync struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these being used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the struct type is being used here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not add another async result endpoint. I believe we will add a universal endpoint to check for async results at some point down the road rather than have async for some pipelines and not others. @rickstaa what are your thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that generalisation is better to be done in the case async requests' result endpoints for all the pipelines equally. But right now for this PR should I keep the async functioning? @ad-astra-video What would you suggest?!
cc @rickstaa

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer remove the async part on this PR and we put on the road map a more general way to make all requests async if preferred by the user.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the latest commit 👍

@RUFFY-369
Copy link
Author

@RUFFY-369 I have looked through the code and ran it E2E (with updates below) in docker. You did a good job getting all the remote worker parts together!

Some updates I sent in a PR to your branch, feel free to merge or use as a guide to adjust your branch:

  1. I updated ai_http.go to use ffmpeg.GetCodecInfoBytes to get the outpixels calculation. Also updated ai-worker to use LPMS function because ffprobe is not installed in docker container.
  2. Updated core/ai_worker.go to transcode all the frames to one MP4. It was transcoding each frame individually in separate mp4s I think before the update.
    2a. I also updated to try and guess an appropriate bitrate by using the ffmpeg transcode profiles in LPMS. This is not perfect but is better than assigning one bitrate for all resolutions. We can improve this in a future PR.

Thanks for the PR as it was also a TODO for me to get all frames into one mp4. I have merged those changes already 👍

Please update to add:

  1. Return the detection data to the user, it is dropped with only the video returning right now.
    Done in the recent commits. Please have a E2E run on your side for cross validation 🚀
  2. Some additional suggestions in the ai-worker PR for some things to add to the data returned.

Can you elaborate a little?! Are there any changes in ai-worker repo which are to be made because I also pushed the required changes on that repo when the go-livepeer changes were introduced.

Questions:

  1. Looks like similar APIs are charging by the minute which I think is easier conceptually than pricing based on pixels. Do you think pricing by seconds makes sense on this pipeline? Do you know if inference time changes significantly based on input size? To note, we already have the duration from the input pulling from ffmpeg for the audio-to-text pipeline so should be pretty low lift to change to pricing by the second (or millisecond used by audio-to-text pipeline). cc @rickstaa if has some thoughts on pricing of this pipeline?

Regarding the delta in inference time with respect to the input size, I did some inference runs for files of various sizes on a T4 GPU with google colab. So, there were five files each of size 12.9MB, 21.3Mb, 39.8 Mb, 97.8 Mb & 116.1Mb. One thing to note was also that even if the size of the input video file is little bit more than an another file due to the increased number of frames, still if the resolution is high for the smaller sized file, the inference time will be more for it.
Other than that the inference time varied as follows for all these files: t->1.54t->2.73t->9.06t->6.37t (in multiples of t) respectively.

Can the initial pricing be done similar to SAM2 : x USD per input pixel(height*width*frames) . If to think about the pricing, mainly we can either do pricing based on the compute seconds or based on the model output. For this pipeline I think based on compute seconds will be an appropriate metric as based on model output would suit for generative models where resolution can be varied from high quality to low quality images.

  1. Do you know if there is a good way to render the boxes client side? In my research the html5 video tag seems to not be the same across browsers when a new frame is displayed or signaling time updates (firefox looks like every frame, chrome/safari every 200-250ms). Maybe seeing the detected items is mostly for debugging/confirming the model is working and most would just want the detection data. That said, I think is good to be able to return both if wanted with the batch processing API.

Hmmm..regarding rendering boxes client-side, I’ll need to explore this further as it’s an area I haven’t deeply worked on yet. Availability of choice is a better option as annotation of the frames can be done with detected output data as well out of the pipeline loop.

@RUFFY-369
Copy link
Author

Also @ad-astra-video If you could cross-check on your side as well with an E2E run with the recent commits that were pushed.
Otherwise I have addressed all the requested changes. 👍
Thanks

@ad-astra-video
Copy link
Collaborator

Can you elaborate a little?! Are there any changes in ai-worker repo which are to be made because I also pushed the required changes on that repo when the go-livepeer changes were introduced.

The object-detection route in go-livepeer only returns the video right now from the Orchestrator. The ai-runner returns all the information but go-livepeer is dropping the detection data in the parseMultiPartResult function where it is converting it to the ImageResponse. I think we should be returning the ObjectDetectionResponse with the video returned being optional to the user since it is slower CPU encoding right now.

Can the initial pricing be done similar to SAM2 : x USD per input pixel(heightwidthframes) . If to think about the pricing, mainly we can either do pricing based on the compute seconds or based on the model output. For this pipeline I think based on compute seconds will be an appropriate metric as based on model output would suit for generative models where resolution can be varied from high quality to low quality images.

I think pricing based on pixels is the most accurate on compute difficulty since it would incentivize users to send in lower resolution samples to process (eg 720p or lower) to get a better price. That said, other services price the inference based on video seconds so that would be easiest for users to convert to using Livepeer network. I am fine with leaving pricing as per pixel for now to be similar to other pipelines. Audio uses input file time length based pricing only because there is no pixels to count.

Regarding the delta in inference time with respect to the input size, I did some inference runs for files of various sizes on a T4 GPU with google colab. So, there were five files each of size 12.9MB, 21.3Mb, 39.8 Mb, 97.8 Mb & 116.1Mb. One thing to note was also that even if the size of the input video file is little bit more than an another file due to the increased number of frames, still if the resolution is high for the smaller sized file, the inference time will be more for it.
Other than that the inference time varied as follows for all these files: t->1.54t->2.73t->9.06t->6.37t (in multiples of t) respectively.

I was not clear on what I was asking, sorry about that. I was curious about the inference time difference between say 1080p and 360p. Below is the examples of inference time difference at the two resolutions using the same input video. The inference time difference is a little less than 10% faster using 360p but decoding is about 800% faster so should in my opinion cost less to process.

1080p

2024-11-27 16:02:27,543 - app.routes.object_detection - INFO - Decoding video: video size: 3779273
2024-11-27 16:02:30,512 - app.routes.object_detection - INFO - Decoded video in 2.95 seconds
2024-11-27 16:02:43,502 - app.routes.object_detection - INFO - Detections processed in 12.99 seconds

360p note: annotating the frames adds about 1 second to detections time in this 10 second video

2024-11-27 15:56:24,062 - app.routes.object_detection - INFO - Decoded video in 0.37 seconds
2024-11-27 15:56:35,177 - app.routes.object_detection - INFO - Detections processed in 11.12 seconds
2024-11-27 15:56:46,657 - app.routes.object_detection - INFO - Annotated frames converted to data URLs in 11.48 seconds, frame count: 266
2024-11-27 15:56:46,855 INFO:     172.17.0.1:58998 - "POST /object-detection HTTP/1.1" 200 OK

@RUFFY-369
Copy link
Author

Can you elaborate a little?! Are there any changes in ai-worker repo which are to be made because I also pushed the required changes on that repo when the go-livepeer changes were introduced.

The object-detection route in go-livepeer only returns the video right now from the Orchestrator. The ai-runner returns all the information but go-livepeer is dropping the detection data in the parseMultiPartResult function where it is converting it to the ImageResponse. I think we should be returning the ObjectDetectionResponse with the video returned being optional to the user since it is slower CPU encoding right now.

I have changed the ImageResponse result to ObjectDetectionResponse as output in the past commits after you pointed it out.

Can the initial pricing be done similar to SAM2 : x USD per input pixel(height_width_frames) . If to think about the pricing, mainly we can either do pricing based on the compute seconds or based on the model output. For this pipeline I think based on compute seconds will be an appropriate metric as based on model output would suit for generative models where resolution can be varied from high quality to low quality images.

I think pricing based on pixels is the most accurate on compute difficulty since it would incentivize users to send in lower resolution samples to process (eg 720p or lower) to get a better price. That said, other services price the inference based on video seconds so that would be easiest for users to convert to using Livepeer network. I am fine with leaving pricing as per pixel for now to be similar to other pipelines. Audio uses input file time length based pricing only because there is no pixels to count.

I think that overall pricing should get an update for all the pipelines by considering a subtle and not much complex combination of different metric. But for now I am leaving the pricing to be based on pixel similar to other pipelines like you mentioned. 👍

Regarding the delta in inference time with respect to the input size, I did some inference runs for files of various sizes on a T4 GPU with google colab. So, there were five files each of size 12.9MB, 21.3Mb, 39.8 Mb, 97.8 Mb & 116.1Mb. One thing to note was also that even if the size of the input video file is little bit more than an another file due to the increased number of frames, still if the resolution is high for the smaller sized file, the inference time will be more for it.
Other than that the inference time varied as follows for all these files: t->1.54t->2.73t->9.06t->6.37t (in multiples of t) respectively.

I was not clear on what I was asking, sorry about that. I was curious about the inference time difference between say 1080p and 360p. Below is the examples of inference time difference at the two resolutions using the same input video. The inference time difference is a little less than 10% faster using 360p but decoding is about 800% faster so should in my opinion cost less to process.

1080p

2024-11-27 16:02:27,543 - app.routes.object_detection - INFO - Decoding video: video size: 3779273
2024-11-27 16:02:30,512 - app.routes.object_detection - INFO - Decoded video in 2.95 seconds
2024-11-27 16:02:43,502 - app.routes.object_detection - INFO - Detections processed in 12.99 seconds

360p note: annotating the frames adds about 1 second to detections time in this 10 second video

2024-11-27 15:56:24,062 - app.routes.object_detection - INFO - Decoded video in 0.37 seconds
2024-11-27 15:56:35,177 - app.routes.object_detection - INFO - Detections processed in 11.12 seconds
2024-11-27 15:56:46,657 - app.routes.object_detection - INFO - Annotated frames converted to data URLs in 11.48 seconds, frame count: 266
2024-11-27 15:56:46,855 INFO:     172.17.0.1:58998 - "POST /object-detection HTTP/1.1" 200 OK

Thank you for the clarification!
I presumed as well that maybe you could be asking in resolution context too. As I also mentioned in the previous reply that I noticed not for the same video but for different videos that resolution plays more important role than the overall size(frames res X time) of the video input in complete inference time. Which means for user to get quick result, they for sure have to decrease their video res and not the number of video seconds.

The data you provided gives quite nice insights.
Hmmm...as previously discussed annotations can be added as optional functionality in the api.

@RUFFY-369
Copy link
Author

@ad-astra-video What are the final changes which I need to make to get this PR ready for merge as most of them are addressed I thiink?!

server/ai_http.go Outdated Show resolved Hide resolved
@ad-astra-video
Copy link
Collaborator

@ad-astra-video What are the final changes which I need to make to get this PR ready for merge as most of them are addressed I thiink?!

@RUFFY-369 I put a PR up to remove the remaining async object detection route code and update go.mod and go.sum.

This PR is in good shape but the ai-worker PR needs to be completed before merging this PR. There will be some changes needed in this PR from the updates requested in the ai-worker PR for sending the re-encoded video back from the runner but expect them to be relatively minor.

@RUFFY-369
Copy link
Author

@ad-astra-video What are the final changes which I need to make to get this PR ready for merge as most of them are addressed I thiink?!

@RUFFY-369 I put a PR up to remove the remaining async object detection route code and update go.mod and go.sum.

This PR is in good shape but the ai-worker PR needs to be completed before merging this PR. There will be some changes needed in this PR from the updates requested in the ai-worker PR for sending the re-encoded video back from the runner but expect them to be relatively minor.

Hi @ad-astra-video Thanks for the PR, I will take a look and get it merged.
The updates that you requested in the ai-worker PR comment right?! I will get them done just now.
Let me just get both of them done so that you can review the changes and then lets get this pipeline merged 🙏

@RUFFY-369
Copy link
Author

RUFFY-369 commented Dec 1, 2024

@ad-astra-video What are the final changes which I need to make to get this PR ready for merge as most of them are addressed I thiink?!

@RUFFY-369 I put a PR up to remove the remaining async object detection route code and update go.mod and go.sum.
This PR is in good shape but the ai-worker PR needs to be completed before merging this PR. There will be some changes needed in this PR from the updates requested in the ai-worker PR for sending the re-encoded video back from the runner but expect them to be relatively minor.

Hi @ad-astra-video Thanks for the PR, I will take a look and get it merged. The updates that you requested in the ai-worker PR comment right?! I will get them done just now. Let me just get both of them done so that you can review the changes and then lets get this pipeline merged 🙏

@ad-astra-video I have made the requested changes in the ai-worker repo and made the corresponding changes in this PR to support them. You can have a look 👍 🚀

@ad-astra-video
Copy link
Collaborator

ad-astra-video commented Jan 2, 2025

@RUFFY-369 i put up a PR on your repo for some changes I used to test end to end. Relatively small changes mostly and some changes to incorporate changes in the PR i put up on your ai-worker repo.

Can you rebase this to master? Then we can merge!

@RUFFY-369
Copy link
Author

@RUFFY-369 i put up a PR on your repo for some changes I used to test end to end. Relatively small changes mostly and some changes to incorporate changes in the PR i put up on your ai-worker repo.

Can you rebase this to master? Then we can merge!

@ad-astra-video I have merged your PR. Thanks!
And also rebased this to master 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI Issues and PR related to the AI-video branch.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants