[AI] Add support for Object Detection pipeline #3228

RUFFY-369 · 2024-11-01T15:19:24Z

What does this pull request do? Explain your changes. (required)

Adds support for the ai-worker pipeline which implements the (real-time) RT-DETR object-detection model (default)

Corresponding AI-Worker PR: livepeer/ai-worker#243
Specific updates (required)

How did you test each of these updates (required)

Test was performed by running the gateway + worker locally.

Does this pull request close any open issues?

Checklist:

Read the contribution guide
make runs successfully
All tests in ./test.sh pass
README and other documentation updated
Pending changelog updated

cc @rickstaa

…detection

…orker

…tection

ad-astra-video · 2024-11-18T02:28:43Z

Did an initial look through and it looks pretty good. Will update tomorrow when I can pull and run it locally. Thank you for adding tests!

One question, would it make sense to have an option to only return the detections text response? The transcoding of the frames is CPU transcoding so will be pretty slow. @rickstaa or @leszko looking at doing nvidia transcoding is still a bit in the future right?

leszko · 2024-11-18T09:11:18Z

One question, would it make sense to have an option to only return the detections text response? The transcoding of the frames is CPU transcoding so will be pretty slow. @rickstaa or @leszko looking at doing nvidia transcoding is still a bit in the future right?

I don't have enough context to answer this. @rickstaa may know more.

RUFFY-369 · 2024-11-18T16:36:42Z

One question, would it make sense to have an option to only return the detections text response? The transcoding of the frames is CPU transcoding so will be pretty slow. @rickstaa or @leszko looking at doing nvidia transcoding is still a bit in the future right?

If @rickstaa comments on the future of cpu transcoding then I can push a commit adding labels_only header to the API to avoid calling transcodeFrames.
PS If agreed after discussion then the header can be added after performing E2E testing of the pipeline in this state of the PR.

rickstaa · 2024-11-18T16:52:11Z

@leszko, @ad-astra-video I haven’t planned to replace the CPU for transcoding yet. I was considering holding off on that until it becomes necessary for the real-time version of this pipeline. I think for now @ad-astra-video's solution makes sense.

…tection

* Add Gateway ETH Address to Kafka events * Uncaught error * Typo

This commit ensures that the PR labeler action is working as expected again.

ad-astra-video · 2024-11-20T04:04:05Z

@RUFFY-369 I have looked through the code and ran it E2E (with updates below) in docker. You did a good job getting all the remote worker parts together!

Some updates I sent in a PR to your branch, feel free to merge or use as a guide to adjust your branch:

I updated ai_http.go to use ffmpeg.GetCodecInfoBytes to get the outpixels calculation. Also updated ai-worker to use LPMS function because ffprobe is not installed in docker container.
Updated core/ai_worker.go to transcode all the frames to one MP4. It was transcoding each frame individually in separate mp4s I think before the update.
2a. I also updated to try and guess an appropriate bitrate by using the ffmpeg transcode profiles in LPMS. This is not perfect but is better than assigning one bitrate for all resolutions. We can improve this in a future PR.

Please update to add:

Return the detection data to the user, it is dropped with only the video returning right now.
Some additional suggestions in the ai-worker PR for some things to add to the data returned.

Questions:

Looks like similar APIs are charging by the minute which I think is easier conceptually than pricing based on pixels. Do you think pricing by seconds makes sense on this pipeline? Do you know if inference time changes significantly based on input size? To note, we already have the duration from the input pulling from ffmpeg for the audio-to-text pipeline so should be pretty low lift to change to pricing by the second (or millisecond used by audio-to-text pipeline). cc @rickstaa if has some thoughts on pricing of this pipeline?
Do you know if there is a good way to render the boxes client side? In my research the html5 video tag seems to not be the same across browsers when a new frame is displayed or signaling time updates (firefox looks like every frame, chrome/safari every 200-250ms). Maybe seeing the detected items is mostly for debugging/confirming the model is working and most would just want the detection data. That said, I think is good to be able to return both if wanted with the batch processing API.

If you want to try out, my docker builds are adastravideo/go-livepeer:object-detection and adastravideo/ai-runner:object-detection. Attaching the result of one detection run for reference:

e2b0f489.mp4

core/ai_worker.go

server/ai_http.go

ad-astra-video · 2024-11-20T04:10:09Z

go.mod

@@ -257,3 +259,5 @@ require (
 	lukechampine.com/blake3 v1.2.1 // indirect
 	rsc.io/tmplfunc v0.0.3 // indirect
 )
+


Note to remove before merging.

Just give me a heads up before we merge and I will get it removed in the last commit

ad-astra-video · 2024-11-20T04:11:02Z

server/ai_mediaserver.go

@@ -48,6 +48,31 @@ const (
 	Complete   ImageToVideoStatus = "complete"
 )

+type ObjectDetectionResponseAsync struct {


Are these being used?

Yes, the struct type is being used here

I think we should not add another async result endpoint. I believe we will add a universal endpoint to check for async results at some point down the road rather than have async for some pipelines and not others. @rickstaa what are your thoughts?

I agree with you that generalisation is better to be done in the case async requests' result endpoints for all the pipelines equally. But right now for this PR should I keep the async functioning? @ad-astra-video What would you suggest?!
cc @rickstaa

I would prefer remove the async part on this PR and we put on the road map a more general way to make all requests async if preferred by the user.

Done in the latest commit 👍

Object detection

…tection

RUFFY-369 · 2024-11-23T16:11:36Z

@RUFFY-369 I have looked through the code and ran it E2E (with updates below) in docker. You did a good job getting all the remote worker parts together!

Some updates I sent in a PR to your branch, feel free to merge or use as a guide to adjust your branch:

I updated ai_http.go to use ffmpeg.GetCodecInfoBytes to get the outpixels calculation. Also updated ai-worker to use LPMS function because ffprobe is not installed in docker container.

Updated core/ai_worker.go to transcode all the frames to one MP4. It was transcoding each frame individually in separate mp4s I think before the update.
2a. I also updated to try and guess an appropriate bitrate by using the ffmpeg transcode profiles in LPMS. This is not perfect but is better than assigning one bitrate for all resolutions. We can improve this in a future PR.

Thanks for the PR as it was also a TODO for me to get all frames into one mp4. I have merged those changes already 👍

Please update to add:

Return the detection data to the user, it is dropped with only the video returning right now.
Done in the recent commits. Please have a E2E run on your side for cross validation 🚀

Some additional suggestions in the ai-worker PR for some things to add to the data returned.

Can you elaborate a little?! Are there any changes in ai-worker repo which are to be made because I also pushed the required changes on that repo when the go-livepeer changes were introduced.

Questions:

Looks like similar APIs are charging by the minute which I think is easier conceptually than pricing based on pixels. Do you think pricing by seconds makes sense on this pipeline? Do you know if inference time changes significantly based on input size? To note, we already have the duration from the input pulling from ffmpeg for the audio-to-text pipeline so should be pretty low lift to change to pricing by the second (or millisecond used by audio-to-text pipeline). cc @rickstaa if has some thoughts on pricing of this pipeline?

Regarding the delta in inference time with respect to the input size, I did some inference runs for files of various sizes on a T4 GPU with google colab. So, there were five files each of size 12.9MB, 21.3Mb, 39.8 Mb, 97.8 Mb & 116.1Mb. One thing to note was also that even if the size of the input video file is little bit more than an another file due to the increased number of frames, still if the resolution is high for the smaller sized file, the inference time will be more for it.
Other than that the inference time varied as follows for all these files: t->1.54t->2.73t->9.06t->6.37t (in multiples of t) respectively.

Can the initial pricing be done similar to SAM2 : x USD per input pixel(height*width*frames) . If to think about the pricing, mainly we can either do pricing based on the compute seconds or based on the model output. For this pipeline I think based on compute seconds will be an appropriate metric as based on model output would suit for generative models where resolution can be varied from high quality to low quality images.

Do you know if there is a good way to render the boxes client side? In my research the html5 video tag seems to not be the same across browsers when a new frame is displayed or signaling time updates (firefox looks like every frame, chrome/safari every 200-250ms). Maybe seeing the detected items is mostly for debugging/confirming the model is working and most would just want the detection data. That said, I think is good to be able to return both if wanted with the batch processing API.

Hmmm..regarding rendering boxes client-side, I’ll need to explore this further as it’s an area I haven’t deeply worked on yet. Availability of choice is a better option as annotation of the frames can be done with detected output data as well out of the pipeline loop.

RUFFY-369 · 2024-11-23T16:14:09Z

Also @ad-astra-video If you could cross-check on your side as well with an E2E run with the recent commits that were pushed.
Otherwise I have addressed all the requested changes. 👍
Thanks

ad-astra-video · 2024-11-27T17:08:23Z

Can you elaborate a little?! Are there any changes in ai-worker repo which are to be made because I also pushed the required changes on that repo when the go-livepeer changes were introduced.

The object-detection route in go-livepeer only returns the video right now from the Orchestrator. The ai-runner returns all the information but go-livepeer is dropping the detection data in the parseMultiPartResult function where it is converting it to the ImageResponse. I think we should be returning the ObjectDetectionResponse with the video returned being optional to the user since it is slower CPU encoding right now.

Can the initial pricing be done similar to SAM2 : x USD per input pixel(heightwidthframes) . If to think about the pricing, mainly we can either do pricing based on the compute seconds or based on the model output. For this pipeline I think based on compute seconds will be an appropriate metric as based on model output would suit for generative models where resolution can be varied from high quality to low quality images.

I think pricing based on pixels is the most accurate on compute difficulty since it would incentivize users to send in lower resolution samples to process (eg 720p or lower) to get a better price. That said, other services price the inference based on video seconds so that would be easiest for users to convert to using Livepeer network. I am fine with leaving pricing as per pixel for now to be similar to other pipelines. Audio uses input file time length based pricing only because there is no pixels to count.

Regarding the delta in inference time with respect to the input size, I did some inference runs for files of various sizes on a T4 GPU with google colab. So, there were five files each of size 12.9MB, 21.3Mb, 39.8 Mb, 97.8 Mb & 116.1Mb. One thing to note was also that even if the size of the input video file is little bit more than an another file due to the increased number of frames, still if the resolution is high for the smaller sized file, the inference time will be more for it.
Other than that the inference time varied as follows for all these files: t->1.54t->2.73t->9.06t->6.37t (in multiples of t) respectively.

I was not clear on what I was asking, sorry about that. I was curious about the inference time difference between say 1080p and 360p. Below is the examples of inference time difference at the two resolutions using the same input video. The inference time difference is a little less than 10% faster using 360p but decoding is about 800% faster so should in my opinion cost less to process.

1080p

2024-11-27 16:02:27,543 - app.routes.object_detection - INFO - Decoding video: video size: 3779273
2024-11-27 16:02:30,512 - app.routes.object_detection - INFO - Decoded video in 2.95 seconds
2024-11-27 16:02:43,502 - app.routes.object_detection - INFO - Detections processed in 12.99 seconds

360p note: annotating the frames adds about 1 second to detections time in this 10 second video

2024-11-27 15:56:24,062 - app.routes.object_detection - INFO - Decoded video in 0.37 seconds
2024-11-27 15:56:35,177 - app.routes.object_detection - INFO - Detections processed in 11.12 seconds
2024-11-27 15:56:46,657 - app.routes.object_detection - INFO - Annotated frames converted to data URLs in 11.48 seconds, frame count: 266
2024-11-27 15:56:46,855 INFO:     172.17.0.1:58998 - "POST /object-detection HTTP/1.1" 200 OK

RUFFY-369 · 2024-11-28T20:54:33Z

Can you elaborate a little?! Are there any changes in ai-worker repo which are to be made because I also pushed the required changes on that repo when the go-livepeer changes were introduced.

The object-detection route in go-livepeer only returns the video right now from the Orchestrator. The ai-runner returns all the information but go-livepeer is dropping the detection data in the parseMultiPartResult function where it is converting it to the ImageResponse. I think we should be returning the ObjectDetectionResponse with the video returned being optional to the user since it is slower CPU encoding right now.

I have changed the ImageResponse result to ObjectDetectionResponse as output in the past commits after you pointed it out.

Can the initial pricing be done similar to SAM2 : x USD per input pixel(height_width_frames) . If to think about the pricing, mainly we can either do pricing based on the compute seconds or based on the model output. For this pipeline I think based on compute seconds will be an appropriate metric as based on model output would suit for generative models where resolution can be varied from high quality to low quality images.

I think pricing based on pixels is the most accurate on compute difficulty since it would incentivize users to send in lower resolution samples to process (eg 720p or lower) to get a better price. That said, other services price the inference based on video seconds so that would be easiest for users to convert to using Livepeer network. I am fine with leaving pricing as per pixel for now to be similar to other pipelines. Audio uses input file time length based pricing only because there is no pixels to count.

I think that overall pricing should get an update for all the pipelines by considering a subtle and not much complex combination of different metric. But for now I am leaving the pricing to be based on pixel similar to other pipelines like you mentioned. 👍

Regarding the delta in inference time with respect to the input size, I did some inference runs for files of various sizes on a T4 GPU with google colab. So, there were five files each of size 12.9MB, 21.3Mb, 39.8 Mb, 97.8 Mb & 116.1Mb. One thing to note was also that even if the size of the input video file is little bit more than an another file due to the increased number of frames, still if the resolution is high for the smaller sized file, the inference time will be more for it.
Other than that the inference time varied as follows for all these files: t->1.54t->2.73t->9.06t->6.37t (in multiples of t) respectively.

I was not clear on what I was asking, sorry about that. I was curious about the inference time difference between say 1080p and 360p. Below is the examples of inference time difference at the two resolutions using the same input video. The inference time difference is a little less than 10% faster using 360p but decoding is about 800% faster so should in my opinion cost less to process.

1080p
2024-11-27 16:02:27,543 - app.routes.object_detection - INFO - Decoding video: video size: 3779273
2024-11-27 16:02:30,512 - app.routes.object_detection - INFO - Decoded video in 2.95 seconds
2024-11-27 16:02:43,502 - app.routes.object_detection - INFO - Detections processed in 12.99 seconds
360p note: annotating the frames adds about 1 second to detections time in this 10 second video
2024-11-27 15:56:24,062 - app.routes.object_detection - INFO - Decoded video in 0.37 seconds
2024-11-27 15:56:35,177 - app.routes.object_detection - INFO - Detections processed in 11.12 seconds
2024-11-27 15:56:46,657 - app.routes.object_detection - INFO - Annotated frames converted to data URLs in 11.48 seconds, frame count: 266
2024-11-27 15:56:46,855 INFO:     172.17.0.1:58998 - "POST /object-detection HTTP/1.1" 200 OK

Thank you for the clarification!
I presumed as well that maybe you could be asking in resolution context too. As I also mentioned in the previous reply that I noticed not for the same video but for different videos that resolution plays more important role than the overall size(frames res X time) of the video input in complete inference time. Which means for user to get quick result, they for sure have to decrease their video res and not the number of video seconds.

The data you provided gives quite nice insights.
Hmmm...as previously discussed annotations can be added as optional functionality in the api.

RUFFY-369 · 2024-11-28T20:56:59Z

@ad-astra-video What are the final changes which I need to make to get this PR ready for merge as most of them are addressed I thiink?!

server/ai_http.go

…move object detection func to generic aiMediaServerHandle

ad-astra-video · 2024-11-29T20:03:05Z

@ad-astra-video What are the final changes which I need to make to get this PR ready for merge as most of them are addressed I thiink?!

@RUFFY-369 I put a PR up to remove the remaining async object detection route code and update go.mod and go.sum.

This PR is in good shape but the ai-worker PR needs to be completed before merging this PR. There will be some changes needed in this PR from the updates requested in the ai-worker PR for sending the re-encoded video back from the runner but expect them to be relatively minor.

RUFFY-369 · 2024-11-29T20:27:24Z

@ad-astra-video What are the final changes which I need to make to get this PR ready for merge as most of them are addressed I thiink?!

@RUFFY-369 I put a PR up to remove the remaining async object detection route code and update go.mod and go.sum.

This PR is in good shape but the ai-worker PR needs to be completed before merging this PR. There will be some changes needed in this PR from the updates requested in the ai-worker PR for sending the re-encoded video back from the runner but expect them to be relatively minor.

Hi @ad-astra-video Thanks for the PR, I will take a look and get it merged.
The updates that you requested in the ai-worker PR comment right?! I will get them done just now.
Let me just get both of them done so that you can review the changes and then lets get this pipeline merged 🙏

Remove remaining async object detection code and update go.mod go.sum

…rames

…ner side

RUFFY-369 · 2024-12-01T12:50:55Z

@ad-astra-video What are the final changes which I need to make to get this PR ready for merge as most of them are addressed I thiink?!

@RUFFY-369 I put a PR up to remove the remaining async object detection route code and update go.mod and go.sum.
This PR is in good shape but the ai-worker PR needs to be completed before merging this PR. There will be some changes needed in this PR from the updates requested in the ai-worker PR for sending the re-encoded video back from the runner but expect them to be relatively minor.

Hi @ad-astra-video Thanks for the PR, I will take a look and get it merged. The updates that you requested in the ai-worker PR comment right?! I will get them done just now. Let me just get both of them done so that you can review the changes and then lets get this pipeline merged 🙏

@ad-astra-video I have made the requested changes in the ai-worker repo and made the corresponding changes in this PR to support them. You can have a look 👍 🚀

ad-astra-video · 2025-01-02T20:23:29Z

@RUFFY-369 i put up a PR on your repo for some changes I used to test end to end. Relatively small changes mostly and some changes to incorporate changes in the PR i put up on your ai-worker repo.

Can you rebase this to master? Then we can merge!

chore: updates for object detection pipeline

RUFFY-369 · 2025-01-06T07:43:01Z

@RUFFY-369 i put up a PR on your repo for some changes I used to test end to end. Relatively small changes mostly and some changes to incorporate changes in the PR i put up on your ai-worker repo.

Can you rebase this to master? Then we can merge!

@ad-astra-video I have merged your PR. Thanks!
And also rebased this to master 👍

RUFFY-369 added 3 commits November 1, 2024 19:53

feat:add initial implementation of support for object detection pipeline

d907edc

Merge remote-tracking branch 'upstream/ai-video' into feature/object-…

dd6598b

…detection

chore:fix remaining merge conflicts

ba4eb5e

RUFFY-369 requested a review from rickstaa as a code owner November 1, 2024 15:19

github-actions bot added the AI Issues and PR related to the AI-video branch. label Nov 1, 2024

RUFFY-369 added 3 commits November 2, 2024 19:20

chore:add missing dependencies for testing locally

ea4dba0

chore:update ai-worker commit hash

cf9e389

fix:build errors for go-livepeer remote-worker docker image

589b6c7

leszko deleted the branch livepeer:master November 7, 2024 08:26

leszko closed this Nov 7, 2024

chore:update server package to enable pipeline processing by remote w…

34f4df7

…orker

rickstaa reopened this Nov 13, 2024

rickstaa changed the base branch from ai-video to master November 13, 2024 21:55

RUFFY-369 added 2 commits November 14, 2024 17:33

chore:fix merge conflicts

e1a3767

Merge remote-tracking branch 'upstream/master' into feature/object-de…

e2ebe08

…tection

RUFFY-369 and others added 5 commits November 19, 2024 01:13

fix:make error

5b77d63

Merge remote-tracking branch 'upstream/master' into feature/object-de…

b229813

…tection

Add Gateway ETH Address to Kafka events (livepeer#3249)

4bb8266

* Add Gateway ETH Address to Kafka events * Uncaught error * Typo

ci: fix PR labeler (livepeer#3254)

f096368

This commit ensures that the PR labeler action is working as expected again.

update input probing to use ffmpeg.GetCodecInfoBytes

1d66460

update transcoding of result

cc62271

ad-astra-video requested changes Nov 20, 2024

View reviewed changes

RUFFY-369 and others added 3 commits November 21, 2024 19:21

Merge pull request #1 from ad-astra-video/object-detection

a4f06d9

Object detection

chore:update returned detection data with ObjectDetectionResponse

9e20c37

Merge remote-tracking branch 'upstream/master' into feature/object-de…

154db36

…tection

ad-astra-video reviewed Nov 28, 2024

View reviewed changes

server/ai_http.go Outdated Show resolved Hide resolved

RUFFY-369 and others added 4 commits November 29, 2024 21:32

chore:remove async processing for object detection

37b1471

chore:add ObjectDetectionResponse in json result parsing as new case

4d7a4b8

remove addl object detection async, re-order ai_mediaserver.go urls, …

4f09e62

…move object detection func to generic aiMediaServerHandle

update go.mod go.sum to remove ffmpeg-go dependency

d5975ad

RUFFY-369 and others added 4 commits November 30, 2024 02:07

Merge pull request #2 from ad-astra-video/object-detection

e9046fc

Remove remaining async object detection code and update go.mod go.sum

chore:changes for base64 encoded video file instead of url for each f…

6d537dd

…rames

chore:nits(add on previous commit)

0a1e6f8

chore:add necessary code for update in ObjectDetectionResponse in run…

bb0c7cf

…ner side

RUFFY-369 and others added 3 commits December 3, 2024 13:34

fix:make error

be3ebe3

updates for object detection pipeline

c59ae30

fix parsing annotated video

7b609f5

RUFFY-369 and others added 2 commits January 6, 2025 12:45

Merge pull request #3 from ad-astra-video/object-detection

80c2fad

chore: updates for object detection pipeline

chore:resolve merge conflicts

a18ef7e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AI] Add support for Object Detection pipeline #3228

[AI] Add support for Object Detection pipeline #3228

RUFFY-369 commented Nov 1, 2024 •

edited

Loading

ad-astra-video commented Nov 18, 2024 •

edited

Loading

leszko commented Nov 18, 2024

RUFFY-369 commented Nov 18, 2024 •

edited

Loading

rickstaa commented Nov 18, 2024

ad-astra-video commented Nov 20, 2024

ad-astra-video Nov 20, 2024

RUFFY-369 Nov 21, 2024

ad-astra-video Nov 20, 2024

RUFFY-369 Nov 21, 2024

ad-astra-video Nov 26, 2024

RUFFY-369 Nov 28, 2024

ad-astra-video Nov 28, 2024

RUFFY-369 Nov 29, 2024

RUFFY-369 commented Nov 23, 2024

RUFFY-369 commented Nov 23, 2024

ad-astra-video commented Nov 27, 2024

RUFFY-369 commented Nov 28, 2024

RUFFY-369 commented Nov 28, 2024

ad-astra-video commented Nov 29, 2024

RUFFY-369 commented Nov 29, 2024

RUFFY-369 commented Dec 1, 2024 •

edited

Loading

ad-astra-video commented Jan 2, 2025 •

edited

Loading

RUFFY-369 commented Jan 6, 2025

[AI] Add support for Object Detection pipeline #3228

Are you sure you want to change the base?

[AI] Add support for Object Detection pipeline #3228

Conversation

RUFFY-369 commented Nov 1, 2024 • edited Loading

ad-astra-video commented Nov 18, 2024 • edited Loading

leszko commented Nov 18, 2024

RUFFY-369 commented Nov 18, 2024 • edited Loading

rickstaa commented Nov 18, 2024

ad-astra-video commented Nov 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RUFFY-369 commented Nov 23, 2024

RUFFY-369 commented Nov 23, 2024

ad-astra-video commented Nov 27, 2024

RUFFY-369 commented Nov 28, 2024

RUFFY-369 commented Nov 28, 2024

ad-astra-video commented Nov 29, 2024

RUFFY-369 commented Nov 29, 2024

RUFFY-369 commented Dec 1, 2024 • edited Loading

ad-astra-video commented Jan 2, 2025 • edited Loading

RUFFY-369 commented Jan 6, 2025

RUFFY-369 commented Nov 1, 2024 •

edited

Loading

ad-astra-video commented Nov 18, 2024 •

edited

Loading

RUFFY-369 commented Nov 18, 2024 •

edited

Loading

RUFFY-369 commented Dec 1, 2024 •

edited

Loading

ad-astra-video commented Jan 2, 2025 •

edited

Loading