Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/mulithreading-changes #168

Merged
merged 11 commits into from
Mar 3, 2024
Merged

Conversation

ndortega
Copy link
Member

found a 10 - 15% performance gain by removing the current queue + multithreading approach and leaning harder on the scheduler built into Julia using a cooperative multitasking strategy.

Additionally, the amount of code to get this working allowed me to completely remove the StreamUtil module, some state variables, and type definitions in favor of a single function - which makes it much easier to maintain.

Copy link

codecov bot commented Feb 29, 2024

Codecov Report

Attention: Patch coverage is 98.11321% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 99.20%. Comparing base (73a8f0a) to head (40700e8).
Report is 1 commits behind head on master.

Files Patch % Lines
src/core.jl 97.43% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #168      +/-   ##
==========================================
+ Coverage   97.49%   99.20%   +1.71%     
==========================================
  Files          20       19       -1     
  Lines        1157     1138      -19     
==========================================
+ Hits         1128     1129       +1     
+ Misses         29        9      -20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ndortega
Copy link
Member Author

Hi @JanisErdmanis,

I have tested both approaches locally on my laptop with Julia 1.6 and 1.10 and found either the same or better performance with these changes. The results of this test almost feel too good to be true which makes me a bit hesitant.

What do you think?

@JanisErdmanis
Copy link
Contributor

JanisErdmanis commented Mar 2, 2024

Hi @ndortega

Sorry, I did not see the message. I am unfamiliar with the parallelism internals and don't yet use them; thus, I can't comment on the performance. But I am happy that there is such a feature and may find it helpful in the future. For now, I can only offer some ideas to consider:

  • serve and serveparallel have very similar functional structures and almost the same kwargs. Perhaps it would be wise to combine them into a single serve where parallelism can be enabled with a keyword argument.
  • It looks very strange and cumbersome to have so many kwarg arguments to startserver, for which closure is also passed. I would suggest to put startserver function body into serve.

I am rather busy and can't help with coding.

EDIT: I think I am able to grasp how parallelism is implemented now. I find it concerning that there is a shared mutable state between threads, which is introduced by the metrics history object. A simple fix would be to create a history object for each thread and then aggregate that in a single history object, which can be shown on the metrics page. But perhaps that is unnecessary for CircularDeque.

@ndortega
Copy link
Member Author

ndortega commented Mar 3, 2024

Hi @JanisErdmanis,

I'm glad you brought up that point about the shared History object. I've used a ReentrantLock to make writing and reading from the history object completely threadsafe. That being said, I'll need to look into memorizing the metrics outputs since these calculations can take a while when there's hundreds of thousands of requests and we call the metrics endpoint on an interval

And I agree with your second point, with these latest changes the functions are basically identical to one another and could be merged into a single one

@ndortega ndortega merged commit db0852d into master Mar 3, 2024
8 checks passed
@JanisErdmanis
Copy link
Contributor

That being said, I'll need to look into memorizing the metrics outputs since these calculations can take a while when there's hundreds of thousands of requests and we call the metrics endpoint on an interval

You could look into implementing a function merge!(history, thread_1_history) which perhaps can be made fast. I don't think that there should be concern about making the metrics dashboard fast yet as this could complicate matters.

An alternative to using ReentrantLock which may imply switching between threads is to run the metrics on a single thread only and do the paralelization after that. Now your handler schematically is paralelizer(metrics(router)) but it could be beneficial to rewrite it as metrics(paralelizer(router)) as then you would no longer need to use a ReentrantLock which should be faster as it does not need synchroniation between tasks.

@ndortega
Copy link
Member Author

ndortega commented Mar 3, 2024

@JanisErdmanis

I liked the idea of dropping the ReentrantLock and writing it like metrics(paralelizer(router)). I reafactored the code to not use any locks and only exclude docs and metrics middleware from getting wrapped with the parallel handler.

Now that the main thread was used for docs & metrics middleware exclusively - I had to make the metrics endpoint was asynchronous in order to prevent the app from halting when viewing the metrics under heavy load. Even after these changes I was surprised to see a 2x slowdown in RPS on my laptop between this approach and the ReentrantLock approach.

If you're curious and want to view this code let me know and I'll push that branch

@JanisErdmanis
Copy link
Contributor

I was surprised to see a 2x slowdown

This is unexpected. I wonder if it could be due to tasks also being scheduled on the main thread which could slow it down. Push the code on the branch, I may have a look.

@ndortega
Copy link
Member Author

ndortega commented Mar 4, 2024

Sure thing, here's the branch I was working out of: https://github.com/OxygenFramework/Oxygen.jl/tree/feature/multi-threading-experimental-changes

@JanisErdmanis
Copy link
Contributor

JanisErdmanis commented Mar 4, 2024

Yeah. I suspect that the performance drop is due to tasks being spawned on the primary thread. I the additional computation with metrics may have affected some fairness policy in the scheduler so the main thread gets to host other threads tasks.

You may find interesting to explore ThreadPools.jl.

EDIT: The issue is that the metrics middleware waits for the response response = handler(req) and thus is blocked to accept new requests.

@ndortega ndortega deleted the feature/mulithreading-changes branch March 8, 2024 02:25
@ndortega
Copy link
Member Author

ndortega commented Mar 8, 2024

Exactly, it won't be worth making this refactor until I speed up the metrics calculations. This shouldn't be too difficult since the helper utils are all very similar.

Out of the 7 metrics returned:

  • 3 of these metrics iterate over the entire history
  • 4 of these metrics iterate over the same filtered part of the history

Possible Next Steps:

  • Break these metrics into two groups with a loop each to
  • Combine these two sets of metrics so it only uses a single loop to calculate all these metrics
  • Implement a memo table to store previous calculations, if done well I may not even need to do either of the points above

@JanisErdmanis
Copy link
Contributor

Perhaps you can also consider of using a channel where requests are being but by the main thread and then processed by each individual thread returning HTTPResponse object in another channel which is the routed back to users. Though I don't know how to do the latter. The benefit of this approach is that it would delegate Julia to schedule tasks. But then again the throughput depends on a single history recording objects and acquiring a lock for the channel...

I think the design decission here needs to be supported by benchmarks. One of the ways is to check how many requests per second it takes to satturate HTTP without metrics and with metrics. If the saturation with metrics is sufficient then it can be shared by every thread and used together with a lock as it is now. Else that needs to be put within each system thread event loop to improve throughput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants