Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage of the resource destructor #178

Open
varsill opened this issue Dec 9, 2024 · 9 comments
Open

High CPU usage of the resource destructor #178

varsill opened this issue Dec 9, 2024 · 9 comments

Comments

@varsill
Copy link

varsill commented Dec 9, 2024

Hello!
Recently I've come across a problem with high CPU usage when processing images with Vix v0.31.1 in Elixir 1.16.3 with OTP 26.
My scenario looks as follows - I am generating a bunch of videos out of sequences of JPEG images in realtime. That is why I need to load images from files and convert them to YUV color space (so that they can later be encoded).
To be more precise, for each video I run the following sequence of operations multiple times per second (corresponding to the video's framerate):

Vix.Image.new_from_binary
Vix.Image.write_to_binary
Vix.Operation.jpegload_buffer
Vix.Operation.linear!
Vix.Operation.bandjoin!
Vix.Operation.xyz
Vix.Operation.composite2
Image.YUV.write_to_binary

Though this task indeed is quite computationally heavy, it's looks to me as if it was using more resources than I expect.
Apart from that I am unable to scale the number of videos beyond 5/6, despite there are still some CPU resources left on my computer.
I did some profiling and noticed that normal scheduler 1 is pushed to its boundaries, with its utilization reaching 100%, in contrast to other normal schedulers, which are almost unused. Below there is a screenshot from the Erlang's observer:
image

I thought it might have something to do with the fact the the dirty NIFs resources destructor is always run on the normal scheduler 1 (https://github.com/erlang/otp/blob/51548614b588c667a4230c17e2b187474dd70d63/erts/emulator/beam/erl_nif.c#L2841)
so I decided to see what operations are executed on that scheduler. I used sample to do the beam.smp process profiling and it turns out, that the majority of time spent by the normal scheduler 1 thread is indeed spent on run_resource_dtor.
Below there is part of the output of my profiling results:

   36632 Thread_2779870: erts_sched_1
    + 25350 ???  (in <unknown binary>)  [0x103f21a64]
    + ! 25346 ???  (in <unknown binary>)  [0x103e6dee0]
    + ! : 21982 erts_schedule  (in beam.smp) + 8660  [0x102adaf14]
    + ! : | 21953 handle_aux_work  (in beam.smp) + 3652  [0x102adff50]
    + ! : | + 21916 run_resource_dtor  (in beam.smp) + 256  [0x102cb3a04]
    + ! : | + ! 21245 g_object_unref  (in libvips.42.dylib) + 224  [0x35c327b88]
    + ! : | + ! : 17932 ???  (in libvips.42.dylib)  load address 0x35c088000 + 0x146d40  [0x35c1ced40]
    + ! : | + ! : | 17930 g_signal_emit  (in libvips.42.dylib) + 28  [0x35c334b60]
    + ! : | + ! : | + 17778 g_signal_emit_valist  (in libvips.42.dylib) + 64  [0x35c33418c]
    + ! : | + ! : | + ! 17613 ???  (in libvips.42.dylib)  load address 0x35c088000 + 0x2ac708  [0x35c334708]
    + ! : | + ! : | + ! : 17216 ???  (in libvips.42.dylib)  load address 0x35c088000 + 0x2ad35c  [0x35c33535c]
    + ! : | + ! : | + ! : | 17210 g_closure_invoke  (in libvips.42.dylib) + 228  [0x35c323fa8]
    + ! : | + ! : | + ! : | + 17145 ???  (in libvips.42.dylib)  load address 0x35c088000 + 0x143b40  [0x35c1cbb40]
    + ! : | + ! : | + ! : | + ! 17143 g_object_set  (in libvips.42.dylib) + 28  [0x35c329bac]
    + ! : | + ! : | + ! : | + ! : 16589 g_object_set_valist  (in libvips.42.dylib) + 700  [0x35c3295c0]
    + ! : | + ! : | + ! : | + ! : | 16413 g_object_unref  (in libvips.42.dylib) + 224  [0x35c327b88]
    + ! : | + ! : | + ! : | + ! : | + 11518 ???  (in libvips.42.dylib)  load address 0x35c088000 + 0x146d40  [0x35c1ced40]
    + ! : | + ! : | + ! : | + ! : | + ! 11517 g_signal_emit  (in libvips.42.dylib) + 28  [0x35c334b60]
    + ! : | + ! : | + ! : | + ! : | + ! : 11386 g_signal_emit_valist  (in libvips.42.dylib) + 64  [0x35c33418c]
   ...

What is interesting, I see that the part starting with g_signal_emit is repeating itself several thousands of times, apparently decreasing some reference counter.
I've tried reducing the Vips concurrency and after setting it to 1, the amount of CPU used by run_resource_dtor is smaller, but still significant (around 50% of time obtained with Vips concurrency set to 8).

I've several questions concerning my issue:

  1. Why does resource deallocation in Vips takes that much of CPU resources?
  2. Is it expected that "g_signal_emit cycle" is run that many times?
  3. Are there some Vips flags that might be used to do the deallocation "in a bulk"?
@akash-akya
Copy link
Owner

@varsill thanks for a detailed issue report and analysis.

Deallocation taking lot of CPU is certainly odd. I'll check this as high priority as soon as I get sometime.

Are there some Vips flags that might be used to do the deallocation "in a bulk"?

As of now no.

Is it possible to share a sample code so I can reproduce this easily? I also like to see the context in which these operations are performed (process involvement etc)

Also, a blind suggestion if you want an immediate workaround, try wrapping all the vips operations for each frame in separate process and see if it helps? Something like Task.async(fn -> ... end) |> Task.await()

@varsill
Copy link
Author

varsill commented Dec 10, 2024

@akash-akya Thank you very much for a quick response!

Deallocation taking lot of CPU is certainly odd. I'll check this as high priority as soon as I get sometime.

Great, thanks in advance!

Also, a blind suggestion if you want an immediate workaround, try wrapping all the vips operations for each frame in separate process and see if it helps? Something like Task.async(fn -> ... end) |> Task.await()

Indeed, this one seems to solve an issue with uneven work distribution among the normal schedulers. When I run :erlang.statistics(:microstate_accounting), I see that the normal schedulers are now sharing the auxilary work. It allows me to scale the number of concurrent streams even further, thanks for that workaround suggestion! Unfortunately, the overall CPU consumption seems to be slightly higher than without that change and I still see that much of work is spent on run_resource_dtor.

Is it possible to share a sample code so I can reproduce this easily?

Sure, I will try to prepare some sample script that reproduces that behaviour

@akash-akya
Copy link
Owner

Hi @varsill, any luck with sample script? :)

@varsill
Copy link
Author

varsill commented Dec 13, 2024

Hello @akash-akya!
Sorry for late reply, unfortunately I am having trouble with providing the script using which you could reproduce the issue. This is a script that reflects the operations I do in my system: https://gist.github.com/varsill/ea729390b44ffb256a0dbd91bfb146e1

To run it, you first need to:

  • create inputs/ directory at the same level where you put the script
  • put a bunch of jpeg images to the ./inputs directory - for instance you can generate them out of some input .mp4 file with the following command: ffmpeg -i input.mp4 -vf "fps=30" output_%06d.jpeg

However, when I run this script I am unable to observe the behaviour I encountered in my system.
I think the occurrence of the problem has something to do with either the workload generated by the rest of my system, or the lifetime of the binaries in the system.

I kept on debugging my system and I was able to find out that:

  • when I don't calculate the mask function for each frame, but instead prepare the mask in advance (which in my case is slightly problematic as my input images are of a different resolution), the problem occurs "occasionally".
  • "Occasionally" occurrence of the problem means that sometimes I don't observe high CPU usage (and then for how_many_videos = 5 I observe 5 times the CPU usage I measured for how_many_videos = 1) while at other times I observe excessive CPU usage by the NIF destructors.
  • When I observe excessive CPU usage, in the results of the microstate_accounting I also observe that most of the first scheduler's time is spent on the aux jobs.
  • I've observed that when the excessive CPU usage occurs, the number of libvips worker threads is abnormally high (reaching 80 at some times). It looks as if vips is dynamically spawning threads even though I have set some fixed concurrency_set limit.
  • I've figured out that I can prevent vips from spawning threads dynamically with VIPS_MAX_THREADS env and when I set it to the number of cores on my computer, the problem with excessive CPU usage goes away. Unfortunately, I am unable to scale the number of videos further as I start to see Failed to write extracted region to memory errors.

@varsill
Copy link
Author

varsill commented Dec 17, 2024

Hi @akash-akya !
I've also opened a thread on Erlang forum: https://erlangforums.com/t/how-to-deal-with-destructors-that-can-take-a-while-to-run-and-possibly-block-the-scheduler/ where I asked how potentially long running resource destructors should be handled. I am slightly worried that doing all these g_object_unref in a destructor might cause some trouble (I've already saw it waiting on some OS mutexes, which I believe might preempt the whole scheduler's thread).
Please tell me what do you think about the approach suggested by jhogberg in one of the answers there i.e. doing the resource cleanup in some dedicated process running NIF on a dirty scheduler "on behalf" of the destructor.
Best wishes!

@akash-akya
Copy link
Owner

akash-akya commented Dec 17, 2024

Hi @varsill,

I tried the script you shared, but I couldn't reproduce the behavior, it might be tough to track the exact issue.

I am trying to understand the issue/concern better so please help me out if these assumptions are correct:

CPU Utilization

The main issue here is we are not able to utilize all of available CPU in the host system? because when I tested locally it always uses all of the available CPU. High scheduler usage on its own does not imply there is an issue because image processing operation itself might be expensive (maybe we don't want to libvips to use everything but that is different question).

You might know this already but all libvips operations happen in separate thread, so you won't see anything in thread trace, only unref call happens within the scheduler and it is a recursive call. For example <VipsArray>[image1, image2] recursively unref nested elements. Unless there is some leak, or funny logic unref on its own not an issue. I did try to find issues but wasn't able find anything. It seems in the script we are creating lot of intermediate objects (which can be optimized to some extent) and that is leading to lot of unref, but there is nothing unusual about that.

NIF code in Normal Scheduler

From the first screenshot you shared, it seems you occasionally see run_resource_dtor in the normal scheduler, which can cause issue. I couldn't reproduce this, but I am guessing if the total concurrent operations exceeding the number of available dirty IO schedulers so it is getting leaked to normal schedulers? It might also be because of other load on the VM (IO operations). What is the max concurrent libvips operations you are running at a time? When this happen, what is the Host system CPU usage? Maybe know the VM details -- dirty schedulers, online schedulers etc. ($ elixir --version )?

Thanks for starting the thread! but at this point I not clear about the root cause to attempt a solution.

If you are interested, I can attempt to optimize the script you shared. Usually libvips works better when operations can be merged. If we do lot of intermediate memory writes then it won't be able to do it (similar to Enum vs Stream)

@varsill
Copy link
Author

varsill commented Dec 18, 2024

Hello @akash-akya , thanks for the answer and explanation!

  1. Concerning "CPU Utilization": The original problem is high CPU usage of running GObject destructor: https://github.com/akash-akya/vix/blob/c9d8b8079d7889524b7399d832f09ca8ae58974b/c_src/g_object/g_object.c#L48C13-L48C26) by the scheduler nr 1 which occurs when libvips spawns many (~60) libvips worker threads. Originally I thought that such a high CPU consumption by destructors is expected and that the problem is that destructors are always run on a single scheduler (which means that they are also run on a single thread) and that it prevents me from scaling further, but later I came to the conclusion that probably the high CPU consumption of destructors is a problem itself. I believe that with "too many" libvips worker threads the synchronization that needs to be done in the destructor takes majority of time. When I limit the number of libvips worker threads (with VIPS_MAX_THREADS env) I see that destructors are not using that much of CPU time, but unfortunately I am not able to scale the number of concurrent libvips operations further as at some point I start receiving Failed to write extracted region to memory errors raised from the write_to_binary function.

  2. Concerning "NIF code in Normal Scheduler":

I couldn't reproduce this, but I am guessing if the total concurrent operations exceeding the number of available dirty IO schedulers so it is getting leaked to normal schedulers?

Oh, I am not sure it works like that - as stated here: https://erlangforums.com/t/how-to-deal-with-destructors-that-can-take-a-while-to-run-and-possibly-block-the-scheduler/4290/3: "if the resource destructor is triggered on a normal scheduler, it will run on that same scheduler. Otherwise, if it is triggered on a dirty scheduler, it will be rescheduled to run on the first scheduler."

What is the max concurrent libvips operations you are running at a time?

The lowest number of concurrent libvips operations for which I was able to spot the excessive CPU usage was 5.

When this happen, what is the Host system CPU usage?

Well, the CPU usage was really high, it has eaten up 1000% of CPUs. When I reduced the number of libvips worker threads to 8, the CPU usage was reduced to 300% and what is more the "throughput" (number of images I was able to compose in a fixed period of time) increased.

Maybe know the VM details -- dirty schedulers, online schedulers etc.

I used the configuration with 12 online schedulers (and the same number of dirty schedulers).

If you are interested, I can attempt to optimize the script you shared. Usually libvips works better when operations can be merged. If we do lot of intermediate memory writes then it won't be able to do it (similar to Enum vs Stream)

Oh, that would be great if you could provide me with some hints on how to merge those operations!

@akash-akya
Copy link
Owner

akash-akya commented Dec 20, 2024

@varsill thanks for sharing the details

When I limit the number of libvips worker threads (with VIPS_MAX_THREADS env) I see that destructors are not using that much of CPU time,

When I reduced the number of libvips worker threads to 8, the CPU usage was reduced to 300% and what is more the "throughput" increased

Understood. This is definitely odd and looks like an issue. And probably a coordination issue like you mentioned. I'll investigate further to reproduce and fix. Any pointers on reproducing this would be helpful :)


I'll try to optimize the operations and see how far I can go

@varsill
Copy link
Author

varsill commented Dec 20, 2024

@akash-akya thank you very much!

Any pointers on reproducing this would be helpful :)

I believe the crucial thing would be to ensure that the resource destructor is triggered from the dirty NIF scheduler. When I run the script I provided a couple of answers above, I see that the destructors are run on normal schedulers (which implies, that they were triggered from these normal schedulers) and I think this is the reason why the problem is not reproduced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants